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DEVELOPMENT OF ADVANCED ACREAGE ESTIMATION METHODS 
1 . INTRODUCTION 

A practical application of remote sensing which is of considerable 
interest is the use of satellite-acquired (LANDSAT) multispectral scanner 
(MSS) data to conduct an inventory of some crop of economic interest such 
as wheat over a large geographical area. Any such inventory requires the 
development of accurate and efficient algorithms for analyzing the struc- 
ture of the data. The use of multi-images (several registered passes over 
the same area during the growing season) increases the dimension of the 
measurement space. As a result, characterization of the data structure 
is a formidable task for an unaided analyst. 

Cluster analysis has been used extensively as a scientific tool to 
generate hypotheses about structure of data sets. Sometimes one can 
reduce a large data set to a relatively small data set by the appropriate 
grouping of elements using cluster analysis. In some cases, the algorithm 
which effects the grouping becomes the basis for actual classification. 

In other cases, the cluster analysis produces groupings of the data which 
in turn serve as a starting point for other algorithms which produce 
acreage estimates. Additional uses of cluster analysis arise in conjunction 
with dimensionality reduction techniques which are used to generate displays 
for purposes of further interactive analysis of the data structure. 

Work carried out under this contract dealt with algorithm development, 
theoretical investigations, and empirical studies. The algorithm development 
tasks centered around the use of the AMOEBA clustering/classification 
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algorithm as a basis for both a color display generation technique and 
maximum likelihood proportion estimation procedure. Theoretical results 
were obtained which form a basis for the maximum lieklihood estimation 
procedures. An approach to analyzing large data reduction systems was 
formulated. An exploratory empirical study of spatial correlation in 
LANDSAT data was also carried out. Specifically, investigations were 
carried out in the following areas: 

Development of Multi-Image Color Images 
Spectral-Spatial Classification Algorithm Development 
Spatial Correlation Studies 
Evaluation of Data Reduction Systems 
Each of these investigations is discussed in turn in the sequel. 
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2. DEVELOPMENT OF MULTI-IMAGE COLOR IMAGES 

In a crop inventory application, the input data for a clustering 
algorithm is a multi-image ; namely, a set of registered images, taken at 
different times, of the same subject. In addition to having multi- 
dimensional data (multi spectral measurements) we also have "multi -pictures" 
of the subject. The availability of this spatial aspect of the data and 
attempts to preserve the spatial integrity were the basis for investigations 
carried out in previous contract periods (see [1] and the references there- 
in). These investigations led to the development of the AMOEBA spatial 
clustering/classification algorithm ([2]) and a distance preserving 
algorithm for dimensionality reduction ([3]). 

The above mentioned algorithms were combined with a model for human 
color vision to formulate a technique for generating a single color image 
from a multi-image. The formulation and results of the technique are 
presented in the attached report: 

Jack Bryant and Gary Breaux, Multi- Image Display for Human Under- 
standing, Contract NAS-9-14689, SR-Tl -04080, Report #22, Department 
of Mathematics, Texas A&M University, August, 1980. 
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3. SPECTRAL-SPATIAL CLASSIFICATION ALGORITHM DEVELOPMENT 

The objective of this study was to formulate and test algorithms 
based on a likelihood function which respected the integrity of some 
predetermined structure in the data. 

For purposes of these investigations, the "pure field data" (patches) 
determined by the AMOEBA algorithm ([2]) were used as the predetermined 
structure. A maximum likelihood parameter estimation procedure (HISSE) 
was designed to respect (take into account) field integrity. 

A mathematical description and implementation of the procedure, along 
with results from preliminary tests appears in the attached report: 

Charles Peters and Frank Kampe, Numerical trials of HISSE, 

Contract NAS-9-14689, SR-H0-00477, Department of Mathematics, 
University of Houston, August, 1980. 

Theoretical results underlying the approach used in the HISSE 
algorithm are discussed in the attached report: 

Charles Peters, On the existence, uniqueness, and asymptotic 
normality of a consistent solution of the likelihood equations 
for nonidentical ly distributed observations — applications to 
missing data problems. Contract NAS-9-14689, SR-H0-00492, 

Department of Mathematics, University of Houston, September, 

1980. 


Additional theoretical results were obtained which address the con- 
vergence of a particular iterative form of the likelihood equations in 
the case of a mixture of densities from (possibly distinct) exponential 
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families. These results appear in the attached report: 

Richard A. Redner, An iterative procedure for obtaining maximum 
likelihood estimates in a mixture model. Contract NAS-9-14689, 
SR-Tl -04081, Division of Mathematical Sciences, University of 
Tulsa, September, 1980. 
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4. SPATIAL CORRELATION STUDIES 

The objective of this study was to gain some insight into the nature 
of the spatial correlation of pixels in Landsat data. In particular, an 
empirical study of neighboring pixels (along scan lines) was carried out in 
an attempt to understand the characteristics of spatial correlation for 
boundary or mixed pixels. Results of this study appear in the attached 
report: 

W. A. Coberly, Spatial correlation in LANDSAT: An empirical 

study. Contract NAS-9-14689, SR-Tl -04082, Division of Mathematical 
Sciences, University of Tulsa, November, 1980. 
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5. EVALUATION OF DATA REDUCTION SYSTEMS 

Data reduction systems which utilize multi -temporal MSS data to 
produce proportion estimates of several crop classes are large and com- 
plicated. Large numbers of vector-valued observations are used, in con- 
junction with algorithms based on various models, to produce these 
estimates. Testing the validity of these models and determining the 
subsequent effect on the accuracy of the proportion estimates cannot 
(in many instances) be carried out. In addition, when the software 
system is (conceptually) the best it may be that properties of the original 
data set in fact impose the accuracy limitations. 

A theoretical approach to determining the limiting accuracy of the 
data set is set forth in the report: 

Virgil R. Marco, Jr. and P. L. Odell, Information in remotely 
sensed data for estimating proportions in mixture densities. 

Contract NAS-9-14689, SR-Tl-04083, Program in Mathematical 
Sciences, University of Texas at Dallas, November, 1980. 
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Abstract . Three recently discovered techniques are combined to produce 
subjectively appealing color displays of multi -temporal Landsat imagery. 

The first technique selects prototypes by use of an unsupervised clustering 
program. These are used to find a linear dimensionality reduction such that 
the inter-prototype separation in the original space is nearly preserved 
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values for an image in which normal human interpretation of color differences 
closely matches the Euclidean distances within the three dimensional pre- 
image. 

Clustering Linear feature selection Landsat 
Color display Human vision Multi-imagery 


*The authors were partly supported by the National Aeronautics and Space 
Administration, Contract NAS-9-14689, principal investigator, L. F. Guseman, Jr. 



1 


Consider the imagery shown in Fig. 1. Each scene of about 23,000 
picture elements (pixels) is a Landsat remotely-sensed image taken from 
the North American Great Plains. The images have been corrected geo- 
metrically to be in close spatial registration to one another. Each was 
acquired on a different date: in May, June, August, and September, 1976. 

The August acquisition is shown in Plate lA, the standard false-color 
product produced at Johnson Space Center, Houston, Texas. The two Landsat 
infra-red bands have no color; the standard product is somewhat like 
color infra-red film. The images of Fig. 1 are small, but the digital 
data set is not, for each pixel is a 16-vector (4 components for each 
acquisition) . 

The high dimensionality of the space in which these data are 
embedded is a common problem in pattern recognition. Most data analysis 
techniques such as clustering or classification require computer time 
at least in proportion to the dimension, and some (e.g. maximum likeli- 
hood classification) require time porportional to the square. Thus a 
common motive for dimensionality reduction is computational complexity. 
Another is human understanding: the presentation of the multi -image in 

the form of Fig. 1 (as four images) is not ideal. Yet there seems to 
exist no better way to present high dimensional imagery for human analysis. 
This is exactly the problem we tackle: is there a way to display the 

imagery of Fig. 1 while retaining the spatial and spectral -temporal 


structure? 
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Fig. 1 Four Pass Landsat Imagery 


i 
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Plate 1. Color Products: A. 

B. 

C. 

D. 


JSC Product 1 

AMOEBA Clustering of Fig. 1 
Principle Components Display 
Distance Preserving Display 
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WHAT IS STRUCTURE? 

By spatial structure we mean the spatial relationship between 

objects in the scene. To preserve spatial structure we produce a single 

image which is pixel -by-pixel registered to the multi -imagery. It is 

not so clear what spectral -temporal structure means. It will surely 

mean different things to different people. Our view is that the structure 

is represented by the Euclidean distances (in the high dimensional space) 

between typical measurement-space samples. Structure is preserved when 

these distances are accurately reproduced in the lower dimensional space. 

A new technique^^^ for linear feature selection has as its objective 

the preservation of distances between samples (prototypes). Rather than 

obtain the prototypes at random, we use the spatial clustering program 
(21 

AMOEBA.' ' Plate IB shows the clustering of the data in Fig. 1 we 

obtain. Note that this cluster map is not an image in the usual sense 

of a picture of a scene. Some of the spatial structure is clearly lost, 

particularly the pattern of roads so easily seen in Fig. 1. 

Because of the spectral overlap between the measurements in any 

one acquisition (and present in the scene), the intrinsic dimensionality 

(31 

of a given acquisition is less than the number of measurements.' ‘ Thus 

we know some of the spectral structure, and use a four-to-two brightness- 

(41 

greenness transformation. ' ' This converts the 16-dimensional data of 
Fig. 1 to 8-dimensional data. Jhis is the data we cluster to produce 
Plate IB. 
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WHAT IS COLOR PERCEPTION? 

A method for reducing dimensionality (and a measure of success) 
is only helpful is we can display the reduced data so it can be understood. 

As an example, suppose the data could be represented in one dimension. 

Then it is natural to produce a gray-scale or black-and-white image. 

Since we know that normal human gray {i.e. non-color) vision has a 
logarithmic response, we prepare an image so that the perceived 
brightness (not the actual brightness) is linearly proportional to the 
transformed data (with, perhaps, a bias to translate the transformed data). 
That is, we consider the physiology of human vision in preparing our image. 

Unfortunately, the multi-imagery of Fig, 1 is not one dimensional 
spectrally: nor is any single acquisition. As we shall see, however, the 
data can be reduced to three dimensions with small errors. Color images 
can be produced with three colors, which suggests color vision is at most 
three dimensional. The easy way to get a color dispaly (reduce dimensionality 
to three, display one red, one green, and one blue) is not appropriate for 
the same reason that we would have been wrong to produce a black-and- 
white image with the flux viewed linearly proportional to the transformed 
data. Namely, this display fails to take into account the physiology of 
human color vision. Indeed, imagery produced in this way is disappointing 
(Hay et al.' '). Instead, we should produce a color image in which 
human perception of color difference matches distances between the objects 
being displayed. To this end, we need to model visual perception. We 
begin with a red-green-blue digital image and follow the processing of 
this image by the visual system. We use the notation of Faugheras, 
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A model for the combined video or photographic system and pigmented 
cone photochemical response gives a linear transformation U to produce 
cone output signals L, M, and S. A model for retinal receptor response 
produces the (nonlinear) transformation by the logarithm function to 
L*, M*, and S*. Next a model for the Ganglion neural connections gives 
a final linear transformation P to signals A, , and C 2 - Signal A is 
brightness and and C 2 are chromaticity signals: these go to the 

visual cortex. (We are ignoring spatial effects.) Faugheras notices 
that each of these transformations is invertible and uses this to trans- 
mit color imagery over a noisy channel with lower bit rate (or better 
perceived signal -to-noise ratio). He reports^^’ ^ a reduction in 
the average bit-rate by a factor of 27. 

A comprehensive survey of color image perception and a bibliographical 

guide is found in Hall^^’ Hall gives a block diagram (p. 42) 

of the monocular visual system (but gives no numeric parameters). 

Faugheras 's work is based on a slightly simpler model (for light-adapted 

(or photoptic) vision). To use his work, one need only determine U. 

He has determined P by psychovisual experiments. There is another approach 

to this problem, outlined by Hall^^’ and followed by Juday^®^ 

(9) 

and Kaneko' . We prefer the approach based on a model, although we do 
not know the exact U for the film product used. This problem is being 
studied, but our requirements are not severe: we do not need strict 

color fidelity. The major problems left are: first, how much of sub- 

jective color space can we occupy without exceeding the film color gamut? 
Second, how do we scale the output image so that it can be displayed on 
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a given digital system? We found experimentally that twenty-five levels 
of brightness A and thirteen levels of each chromaticity channel C-j and 
C2 could be displayed. The details of how to scale everything are less 
interesting and are relegated to the Apoendix. 

Let's now review the end-to-end process. We obtain our connection 
between measurement space and perception space by the following steps: 

1. Using feature selection techniques/^^’^^ ^ reduce the 
dimensionality to three. We use here the principle components 
map and the distance preserving map.^^^ 

2. Apply suitable scaling (see the Appendix) and apply P”\ 
exponential, and U"^ to the transformed image. 

3. Again scale, and display the result on a color monitor or as 
color film. These products make up Plate 1C (the principle 
components map) and Plate ID (the distance preserving map). 

DISCUSSION 

Observers, viewing Plate 1, uniformly prefer the color image ID. 

The cluster map IB is rejected because it is not a picture in the same 
sense that lA, 1C, and ID are pictures, although the clustering shown 
might be a helpful aid to a human analyst. Plate 1C is not favored 
because obviously distinct classes are colored the same. This is cer- 
tainly not the case in ID. We observe that 1C is "too dark," yet it was 
produced by the same method as led to ID; only the feature selection 
method was different. This finding which discredits the principle com- 
ponents approach is new but not entirely unexpected. See, for example, the 



8 


imagery shown in Lowitz.^^^’ The seventh (of seven) 

principle components image contains significant structural information. 
Here we find that the principle components map from 8 to 3 dimensions 
identifies distinct classes, a flaw which goes against our underlying 
purpose. If B IS 3 X 8 matrix and y^,...,yp are the prototypes, let 
P = p(p-l)/2, let 

f(B) = I (l|By,-ByJl - ||y,-y,||)^ 
and let 

N(B) = f(b))^/^ . 

For the principle components map B, N(B) = 9.78, and for the distance 
preserving map N(B) = 0.95. The two are shown in Table 1. 

The main open problem is to make the colors reproducible. The 
experiment reported here used 32 prototypes. In another, using the same 
data and procedure, we let AMOEBA find the natural number of clusters 
rather than the forced number 32. It found 12, and their centers were 
used as prototypes. The resulting image was as satisfactory as ID, but 
red and green were interchanged. Clearly the process does not lead to 
stable color assignments in any absolute sense. Another problem: should 

the spatial aspects of color vision be taken into account? We suspect 
not if one is to view the composite as an image. Image enhancement by 
spatial filtering is another matter. The three perception space channels 
A, C-| , and C2 have different modulation transfer functions. 
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Table 1. 


Principle Components Transformation 


-.6454 

-.2910 

.0362 

.0120 

.3973 

-.2734 

-.4939 

.2356 

.1264 

-.0406 

.0470 

-.2396 

.2934 

-.8290 

.4714 

.3878 

.0495 

-.1812 

.7280 

-.1414 

-.1366 

Transformation which 

Minimizes 

f 




-.4441 

-.2485 

-.0040 

-.5235 

.5668 

-.1261 

-.5492 

.2721 

.1634 

-.1447 

-.2517 

-.6681 

.0082 

-.7080 

.2802 

.2787 

.1073 

.7353 

.1515 

-.3301 

-.5169 


-.1442 

.3065 

-.1530 


-.4266 

.3029 

-.2412 
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The underlying psychovisual experimentation is incomplete in that the 
interaction of perception and filtering A, , and C2 differently has not 
been resolved. Is linear filtering {as by spatial convolution) even the 
appropriate operation in perception space? Results we have obtained so 
far with image enhancement in perception space have been disappointing. 

One sees, on viewing Plate ID, that no saturated red is present. 

This results from our avoidance of the boundary of the color gamut. It 
is safe, but does leave many displayable colors unused. Can these colors 
be used without identifying classes which must be projected onto the 
boundary of the gamut to be displayed? 

SUMMARY 

Linear feature selection and a model for human color vision are 
combined to obtain a connection between multi -imagery and the human 
visual system. The overall objective is to preserve the spatial 
structure of the data as a single image, with perceived color separation 
matching multi-dimensional Euclidean separation in the original measure- 
ment space. The principle components feature selection technique is 
found to fail to separate classes obviously separated in the original 
data. A new distance-preserving linear map is tested and is found to 
accurately represent the measurement-space structure of the data. Color 
products are reproduced to illustrate the results. Several open problems 
are mentioned. An appendix giving all key details of the method is 
included. 
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APPENDIX 

Let the prototypes selected by AMOEBA (or by some other method) 
be denoted by y^,...,yp. Let A be a linear feature selection matrix 
to three dimensions, and let = Ay^ . The transformed prototypes 
preserve some aspect of the data structure in lower dimensional space, 
depending, of course, on the feature selection technique. Let be 
the mean vector of the transformed prototypes, and let z^. = x^. - X|^. 

We first determine a scale factor Sp for the prototypes. For any Sp, 
let w. = s Z-. Determine s so that each w. is in the parallelepiped 

ipi p 1 

[-12,12] X [-6,6] X [-6,6], and at least one w^. is on a face of this 
parallel piped. Let S = SpP~\ where P is the transformation determined 
by psychovisual experiments.^®^ Let u^.j = exp(w^.j), i = l,...,p and 
j = 1,2,3. (We use the second subscript to indicate the j-th component 
of the vector u^-.) Let v^. = U ^u^. Usually v^ would now be translated 
and scaled to fit the range of the display device. The imaging system 
we use*, however, makes transmission density linearly proportional 
to input rather than to the logarithm,^®’ so we compute 

t.. = log V.., j = 1,2,3. Now determine a scale factor s^ and a display 
bias b such that if d^^ = Sp t^.^ + b then each d^^ is in [0,255] and 

at least one d has the value 0 and another has the value 255. 

^ J 

*The Information International FR-80 at Johnson Space Center, 
Houston, Texas. The machine gives transmission density linearly 
proportional to input in a channel with zero input on the other two 
channels. Transmission density is the logarithm of the ratio of the 
transmitted flux with and without the sample's presence in the light 
beam. 



We are now prepared to define the transformation by which all 

data (not just the prototypes) is mapped to perception space. Let 

E • -> be defined by Ep . = exp(p ) , j = 1,2,3. Let 

J J 

+ 33 + 

d = exp(-b/Sp) and define L : E ->■ E by L p = log p. if p. > d 
L^ Pj “ ^ M : E^ E^ be defined by 

M(P,) = [mini p ,255}] , j = 1,2,3. The transformation from input 

J J 

multi -imagery I to gun values G is 
G = M(Sp L‘*'U'^ES(AI-Xn^)+b). 
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by 
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1 . Introduction . 

The Houston Integrated Spatial/Spectral Estimator (HISSE) is a statistical 
estimation procedure based on a normal mixture model which is designed to take 
advantage of spatial associations of LANDSAT data pixels produced by an auto- 
mated spatial/spectral clustering algorithm. The clustering algorithm used in 
this experiment is the AMOEBA algorithm developed at Texas A & M University, 
which is based on the three assumptions listed below [1]. AMOEBA detects 
spatially connected sets of LANDSAT pixels, called patches , whose elements 
are characterized by spectral similarity, within certian tolerances, to their 
neighbors. 

Assumption 1 : Real classes exist. 

Assumption 2 : Each patch contains pixels from one and only one 

real class. 

Assumption 3 Each real class is represented by at least one patch 

No absolute commitment to the agricultural nature of real classes is 
expressed in Lll, however, there is an indication of a high degree of purity 
of patches with respect to ground truth labels when AMOEBA patches are plotted 
on ground truth maps. A more complete study, with the same conclusion, is 
reported in [5J. Therefore, we feel justified in identifying the real classes 
with ground truth labels. In addition to the three assumptions just given, 



HISSE requires the following assumption. 


Assumption 4 - The data from each patch is normally distributed with 
mean and covariance depending only on the class to 
which it belongs. 


Assumption 4 has been challenged, some might say refuted, in [2]. 

However, we take the position that the proper question to ask is whether 
assumption 4 is close enough to the truth to be useful in estimating class 
proportions and labeling classes with ground truth labels. The clustering 
portion of AMOEBA may be described as a k-means algorithm which respects patch 
integrity (see Assumption 2) with a novel way of determining the correct number 
of clusters. As such, it contains no way of compensating for the confusion 
arising from classes with overlapping spectral characteristics. Thus, 

Assumption 4 may be regarded as a step toward mitigating the error in proportion 
estimation which is unavoidable with the classify and count method. Henceforth, 
pixels contained in patches will be called pure pixels, and all others boundary 
pixels. 


2. Mathematical Description . 

It IS assumed that there are m real classes, labelled 1, •••, m, and p 

1 

patches represented by independent random vectors (x!j,0^), •••, (^p>0p) where 

0 f {l,---,m} IS the unknown real class to which patch j belongs and 
0 

Xj = ’ ' ■ ■ ^ of n- vectors representing the spectral data 


from the j;Ui^ patch. The 0 are i.i.d. with = ProbLO =iil unknown and, 

J 0 

given that 0 = £, X is a random sample from an n-variate normal distributi 

J J 

unknown mean and covariance. Notice that is the expected 



fraction of patches belonging to class I and for a given scene may be 
quite different from the fraction of pure pixels belonging to class Z, 
which we denote by The random variable is directly related to 

the total acreage of the patches belonging to class Z. 

The log likelihood function for the parameters is 

1) L = J^log f(Xj) 
where 

m 

2) f<Xo> “ 


and f(,(X.) is the N -fold product normal density 
^ J , J 


N. 


3) 




Despite the apparent complexity of L, it depends on the data only through 
the patch means 

N, 


4) 


1 '' 

m = ^ EX. 
J N. jk 


and scatter matrices 


5) 


Nj 


Once the m 's and 5 's 
J J 

need for the pure data. 


are computed and stored, HISSE has no further 



The numerical procedure used in HISSE for finding a maximum of the 
likelihood function is defined by iteratively substituting into the likelihood 
equations, viz. 


( 6 ) 


(k+1) _ 1 
1 ■ p J=1 


-^(xrr- 


( 7 ) 


,(k+l) 


P 

= r. N. 
J=1 J 


p 

-rndi 




( 8 ) 


Q 


(k+l) 


P 

= r 
j=l 


f (X . ) J js] J 




- M 


(k+l)„ (k+l)T 


where R. 

vj 


S + N m m 
J J J J 


is the noncentral scatter of the jWi^ patch. The values 

fl(X ) 

of the parameters used in evaluating the ratios ^re those at the preceding 

J 

k th step of the algorithm. It is shown in [6] that there is a unique strongly 
consistent solution of the likelihood equations in a neighborhood of the true 
parameters as p -> and that the iteration procedure (6)-{8) converges to the 
consistent solution if the starting values are near it. 


Let N = + 


+ Np be the total number of pure pixels. It is easy to 


show that = a and var( cj) ) < 


IN. Thus, if the patches are nearly 


""" ' 4„2 J=VJ 

uniform in size, the MLE of can be used as a predictor of (|)^. However, the 
least MSE predictor of based on the observed data (assuming that the para- 
meters are known) is 


^ , 1 P 

3, 


9 ) 



Therefore, we take evaluated with the maximum likelihood estimates of 
the parameters as our estimate of 

In processing the boundary pixels, which typically constitute 60-70% of the 
scene, we assume that the boundary data consist of an independent sample from 
a mixture 

. 11 _ 

where the component normal distributions are the same class distributions 
represented in the pure data, plus observations from a contaminant class 
(possibly corresponding to the "not in field" ground truth label) in the tails 
of the . In other words, we assume that a boundary observation 

which IS spectrally unlike all of the pure classes is much more likely to be 
from the contaminating class than an outlier from one of the pure classes. 
Therefore we classify as a contaminant each boundary observation X which 
satisfies 

11 ) " 4 

for all £. = 1, m, where the u^'s and n^^'s are the previously estimated 

2 

pure data class means and covariances and y is a size a critical value 

Aa 

2 

for X with n degrees of freedom. In this experiment we chose a = .1. 

Let Y.J, •••, denote the boundary observations remaining after rejecting 
those classified as contaminants. ' We treat Y^ , • • • , Y^^ as an independent samol 
from the mixture density (10), with unknown mixing proportions a.j , • • • , 



but known components ^nd obtain a MLE of , •••, by successively 

substituting into (6). Obviously, Y-j, •••, Y^^ is, at best, a truncated sample 
from the mixture (10), so that the MLE of a-j , • • • , is asymptotically biased. 
We do not expect this effect to be a reason for serious concern. After obtaining 
the MLE for a -j , •••, a^, we use as our final estimate of the number of pixels 
corresponding to class 2-, the quantity + Ma^, where is given by (9). 

3. Implementation . 

The number of classes assumed in this experiment is determined by AMOEBA 

subroutines PAINT and CLASFY. PAINT produces the pure/boundary division of 

a ^ X 6 mile LACIE segment, an array LABELS containing a patch description for 

each of the pure pixel locations, and a map of the scene showing the pure and 

boundary pixels. CLASFY produces an array CLASS containing the final cluster 

designation of each of the patches. A subroutine STAT2 has been attached to 

AMOEBA which calculates and saves patch sizes (N ), patch means (m ) and 

J J 

noncentral patch scatters (R-)- These statistics are then passed to STAT3 
which uses the CLASS array to compute the fraction (a°) of patches assigned 
to each cluster, the fraction of pure pixels assigned to each cluster, and cluster 
means (y°) and covariances (fip for the pure data only. These cluster 
statistics are used as initial estimates of the parameters for the iteration 
procedure described by (6)-(8). CLASFY occasionally produces a cluster with 
such a small number of pure pixels that an initial covariance estimate cannot be 
calculated. In this case the initial in HISSE is obtained by averaging 

the cluster sample covariance with a multiple of the identity so as to insure that 
the condition number of is no greater than 16. 



(\c) 

After initialization HISSE produces iterative estimates ^ 

of the parameters until a convergence criterion is satisfied, after which the 
estimates 6^ are computed in the manner described in Section 2 and stored. 

The boundary pixels are identified from the LABELS array output by AMOEBA. 

For each one, the quadratic forms (x-y^) are computed and tested 

against the threshold value of in (11). For those boundary pixels not 

rejected by the thresholding procedure, the likelihood ratios f^(x)/f|^(x) 
are computed and stored in a temporary disc file for use in the iteration 
procedure for estimating a-j , •••, a^. Although the number of boundary pixels 
processed is much greater than the number of patches, the cost is comparable to that 
of processing the pure data because the iteration procedure (6) can be carried 
out simply by accessing the temporary file. 

For the purpose of labeling classes HISSE identifies for each class I, 
the three patches j which have the highest posterior probability 

in that class. The spatial coordinates of pixels in these labeling patches 
are obtained from the LABELS array. Thus, in using HISSE, the analyst would 
be required to make a judgement concerning the identity of each class based on 
his ability to label the labeling patches. 

4. Numerical Results . 

The results tabulated in this section are from four passes over LACIE segment 
1618 acquired in May, June, August and September of 1976. The data was preprocessed 
by premultiplying each single pass 4-dimensional data vector by the LANDSAT I 
transformation to brightness-greenness space 

/ 1 1 1 o\ 






0 -1 


1 


1 



and stacking the brightness-greenness vectors to obtain 8-dimensional data 
vectors. The results of the AMOEBA run were 7500 pure pixels, organized 
into 310 patches. The number of clusters estimated by NUMCLU was 13. HISSE 
required 19 iterations to estimate the parameters of the pure data mixture 
model. Of the 15290 boundary pixels, the thresholding procedure rejected 5575. 

The number of passes through the remaining 9725 boundary pixels required to 
produce estimates of the boundary mixing proportions , •••, was 8. 

The total cost of running AMOEBA and HISSE together is much less than that of 
running UHMLE or CLASSY on the full scene. 

Figures 1-4 show the scatter plots in brightness-greenness space, correspond- 
ing to each of the passes, of the means of the patches determined by AMOEBA. 
Particularly in the fourth pass, the tasseled cap configuration described in 
[4] is visible. Figures 5, 6, and 7 show the plotted trajectories of the 
estimated class means from pass to pass on the same coordinate system used in the 
4 th pass scatter plot. The trajectories of the means of the pure data clusters 
produced by AMOEBA would be nearly indistinguishable. It is interesting that 
the class means trajectories eventually given a small grains label exhibit a 
characteristic triangular shape. Obviously, this characteristic can be used as 
an aid in labeling the classes (see 1.3!, for a discussion of this idea). 

Figure 8 tabulates the initial cluster means, cluster variances, and patch 
membership proportions obtained from AMOEBA'S clustering of the pure data. Figure 
9 tabulates class means, variances and patch memebership probabilities (the a's) 
estimated by HISSE. Figure 10 compares the estimates derived from AMOEBA and 
HISSE of the fraction of pure pixels belonging to each cluster (class). Notice 
that in Figure 10, there is a significant difference between the two estimates, 
particularly in the more populous classes. These classes happen to be the most 



spectrally confused classes. There is also an appreciable difference seen in 
Figures 8 and 9 between the respective estimates of the a's, although the 
difference is not as pronounced. 

Figure 11 shows the AMOEBA boundary map for segment 1618 with the three 
labeling patches corresponding to each class outlined. A ground truth map 
was used to attach ground truth labels to the labeling patches and hence to 
the classes. Most of the classes were given a single ground truth label by 
this procedure. Classes 2, 5, 6, 7, were not assigned a single ground truth 
label and appeared to be made up of more than one type of small grains. However, 
each of these classes was clearly small grains. Class 1 was the only really 
difficult class to label; each of its labeling patches represented small grains 
ground truth labels as well as such labels as beans and fallow. In other words, 
the labeling patches for class 1 were spurious. For the purpose of obtaining 
an aggregate small grains estimate, it was assumed that class 1 was a mixture 
of 1/3 small grains, 1/3 beans, and 1/3 fallow acreage. 

Figure 12 shows the final acreage estimate for each of the 13 classes in 
the mixture model, the acreage of the set C of boundary pixels rejected as 
outliers or contaminants, and the crop labels (including "small grains") assigned 
to each class. The aggregate small grains acreage estimate is 15,288. The 
small grains acreage from the ground truth tape is 15,465, an error of only 1.1%. 
If class 1 IS labelled all small grains, the error is 15%. If none of class 1 
is classified small grains, the error is 9.2%. It should be emphasized that the 
problem of labeling cluster #1 from AMOEBA is also serious, since cluster 1 is 
centered near the means of the spurious patches used to label class 1. 

The thresholding of boundary outliers makes a pronounced difference in the 



estimate. The small grains acreage estimate derived from HISSE without 
thresholding would be 19,230, comparable to the estimate of 20,336 derived 
from amoeba's cluster map. 

5. Conclusions . 

The accuracy with which HISSE estimated the small grains acreage in 
segment 1618 was impressive, to say the least, but of course the procedure 
must be tested on other segments for which ground truth is available. Also, 
as we mentioned in Section 4, the accuracy of the estimate depends on the 
classification given to the labeling fields for class 1, the problem class. 

The procedure we used-dividing the class evenly among competing ground truth 
labels - seems fair; however, in an operational situation the class would be 
labeled by an analyst looking at a film product and it seems unlikely that 
he would apportion the class in such a way. In any case, the greatest possible 
relative error was 15%, still a marked improvement over the accuracy obtained 
by labeling AMOEBA'S clusters and counting the cluster assignments, or that 
achieved by HISSE without the thresholding procedure. 

The performance of HISSE, or AMOEBA, depends in large part upon the purity 
with respect to ground truth labels of the patches found by AMOEBA, which is 
influenced by the user defined "percent in fields" parameter in AMOEBA. In this 
experiment we defined the parameter as 50%; that is, we conservatively estimate 
that 50% of the pixels in the scene should be found in fields. By reducing the 
size of this parameter, we expect to produce a higher degree of patch purity 
and thus alleviate the problem of having a class represented by labeling patches 
which should not be patches at all. We hope that this will not aggravate another 



problem, namely that the ground truth map for segment 1618 shows a few large 
fields representing important classes (such as barley) in which no patches 
were found. 

Finally, we note that although the aggregated small grains acreage was 
very accurately estimated, the individual estimates for the various small grains 
classes (spring wheat, barley, oats, and millet) were not nearly as accurate. 
Indeed, several of the HISSE classes could not be given a single one of these 
labels, although they clearly represented small grains. Moreover, there was 
one significant crop class (beans) without a small grains label which was 
seriously underestimated. Thus, the usefulness of HISSE in a multicrop inventory 
'cannot yet be determined. 
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CLUSTER a 




CLUSTER MEAN 




PATCH PROPORTION 

1 

26.04 

110.39 

29.79 

121.70 

36. 

49 

IP. 

02 

26.44 

108.04 

.077 

2 

24.99 

108.48 

28.17 

117.42 

44. 

25 

115. 

57 

34.05 

112.63 

.094 

3 

24.80 

106.86 

28.82 

111.90 

32. 

59 

111. 

73 

21.69 

107.00 

.271 

4 

25.51 

111.64 

30.29 

127.63 

50. 

08 

115. 

15 

39.10 

113.13 

.094 

5 

25.46 

108.75 

29.26 

122.53 

48. 

90 

114. 

94 

36.61 

111.77 

.100 

6 

25.09 

109.24 

29.35 

123.39 

48. 

80 

114. 

94 

18.15 

103.83 

.158 

7 

23.90 

106.14 

28.76 

113.53 

38. 

15 

113. 

07 

37.15 

112.73 

.058 

a 

25.05 

112.20 

33.45 

135.38 

56. 

52 

116. 

32 

17.19 

102.97 

.026 

9 

23.26 

105.98 

29.02 

108.48 

34. 

30 

125. 

54 

25.91 

121.94 

.048 

10 

25.50 

107.50 

35.75 

123.25 

37. 

25 

126. 

50 

20.25 

104.75 

.003 

11 

25.49 

110.83 

30.71 

128.90 

24. 

92 

104. 

16 

19.04 

104.01 

.045 

12 

37.60 

123.64 

37.76 

123.44 

31. 

92 

116. 

60 

25.48 

118.12 

.010 

13 

30.16 

132.47 

31.80 

139.64 

27. 

37 

123. 

07 

20.68 

123.83 

.016 


CLUSTER VARIANCE 


1 

7.98 

10.82 

3.22 

36.25 

51. 

31 

16.82 

12.68 

10. 

60 

2 

6.09 

10.51 

3.25 

25.33 

33. 

50 

8.50 

23.18 

18. 

36 

3 

7.87 

5.24 

7.29 

32.49 

29. 

08 

18.48 

17.25 

12. 

48 

4 

4.54 

18.49 

2.48 

15.77 

32. 

80 

7.96 

16.41 

5. 

97 

5 

9.11 

4.70 

3.13 

21.46 

27. 

59 

6.43 

19.92 

G. 

90 

6 

4.64 

8.34 

4.26 

38.13 

44. 

59 

6.00 

11.1? 

6. 

22 

7 

4.74 

2.60 

6.14 

22.62 

15. 

73 

11.2? 

17.19 

7. 

90 

8 

1.50 

3.18 

3.61 

12.71 

15. 

00 

1.84 

3.43 

1. 

59 

9 

2.90 

3.42 

5.40 

11.30 

11. 

44 

24.02 

8.12 

53. 

75 

10 

4.25 

0.25 

0.69 

35.19 

11. 

19 

4.25 

1.19 

3. 

69 

11 

4.00 

5.83 

5.35 

33.79 

5. 

26 

1.55 

8.07 

3. 

38 

12 

3.28 

2.56 

2.90 

3.69 

1. 

43 

1.61 

3.93 

3. 

95 

13 

1.75 

9.97 

1.38 

5.20 

1. 

31 

2.81 

1.09 
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FINAL CLASS STATISTICS (HISSE) 


CLASS # 







CLASS MEAN 






PATCH probability 

1 

26 . 

91 

109 

.19 

29 

.64 

117 . 

57 

35 . 

07 no . 

,50 

25 . 

53 

107 , 

,45 

.126 

2 

24 . 

,62 

108 

.52 

27 

.91 

117 . 

84 

44 . 

68 115 . 

,93 

35 . 

13 

113 . 

,58 

.083 

3 

24 . 

,11 

106 

.34 

28 

.61 

no . 

87 

33 . 

73 113 . 

,30 

21 . 

65 

107 . 

,51 

.221 

4 

25 . 

,58 

111 

.88 

30 

.23 

126 . 

89 

50 . 

83 115 . 

,51 

39 . 

97 

113 . 

.64 

.084 

5 

25 . 

,30 

108 

.73 

29 

.41 

123 . 

19 

48 . 

09 114 . 

,35 

35 . 

83 

Ill, 

,28 

.108 

6 

25 . 

.10 

109 

.25 

29 

.36 

123 . 

38 

48 . 

73 114 . 

,95 

18 . 

20 

103 , 

.89 

.170 

7 

23 . 

,89 

106 

.13 

28 

.78 

113 . 

49 

38 . 

08 113 . 

,06 

37 . 

04 

112 . 

.70 

.061 

8 

25 , 

,06 

112 

.25 

33 

.47 

135 . 

41 

56 . 

65 116 , 

.35 

17 . 

13 

102 , 

.93 

.023 

9 

23 . 

,26 

105 

.98 

29 

.02 

108 . 

48 

34 . 

30 125 . 

.55 

25 . 

91 

121 . 

.94 

.048 

10 

25 , 

,50 

107 

.50 

35 

.75 

123 . 

25 

37 . 

25 126 . 

.50 

20 . 

25 

104 , 

.75 

.003 

11 

25 , 

,25 

110 

.37 

29 

.80 

127 . 

20 

24 . 

86 104 . 

.14 

19 . 

07 

103 . 

,99 

.048 

12 

37 , 

,60 

123 

.64 

37 

.76 

123 . 

44 

31 . 

92 116 . 

.60 

25 . 

48 

118 , 

,12 

.010 

13 

30 , 

,16 

132 

.47 

31 

.80 

139 . 

64 

27 . 

37 123 . 

,07 

20 . 

68 

123 , 

.83 

.016 








CLASS VARIANCE 







1 

9 , 

,56 

10 . 

44 

5 . 

08 

51.15 

72 

.18 

24.81 

44 . 

.57 

12 , 

.14 



2 

3 , 

,76 

10 . 

02 

2 . 

71 

23.47 

35 

.32 

8.02 

15 , 

.39 

17 , 

.05 



3 

4 , 

.66 

3 . 

29 

6 . 

93 

25.02 

21 

.94 

9.55 

14 , 

.74 

13 , 

.05 



4 

4 , 

,78 

20 . 

68 

2 . 

74 

19.15 

39 

.22 

7.15 

16 , 

,30 

4, 

,31 



5 

9 , 

,48 

4 . 

02 

2 . 

98 

26.54 

20 

.81 

4.94 

18 , 

,06 

6 , 

,76 



6 

4 , 

,60 

8 . 

04 

4 . 

29 

38 42 

44 

64 

5.61 

11 , 

.24 

6 , 

.09 



7 

4 , 

.66 

2 . 

34 

6 . 

15 

22.65 

15 

.92 

11.02 

37 , 

.65 

7 , 

.82 



8 

1 , 

.53 

3 . 

19 

3 . 

62 

12.65 

14 

.57 

1.81 

3 . 

.33 

1 . 

.50 



9 

2 , 

,89 

3 . 

24 

5 . 

36 

11.27 

11 

.47 

23.77 

8 , 

.18 
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CLASS ACREAGE ESTIMATES 


CLASS 

ACREAGE 

CROP LABEL 

1 

3764 

Small Grains 
Beans 

Idle Fallow 

2 

1550 

Small Grains 

3 

3560 

Spring Wheat 

4 

1237 

Spring Wheat 

5 

2253 

Small Grains 

6 

3257 

Small Grains 

7 

1218 

Small Grains 

8 

262 

Spring Wheat 

9 

917 

Idle Cover Crop 

10 

4 

Flax 

11 

697 

Barley 

12 

49 

Homestead 

13 

171 

Trees 

C 

6124 

Contaminated Data 
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1, Introduction 

This paper is concerned with the existence, uniqueness, and asymptotic 
properties of a strongly consistent local maximizer of the likelihood 
function for a vector parameter in the case of nonidentically distributed 
samples and without prior assumptions which insure the existence of a global 
MLE. Well known results pertaining to scalar parameters and i.i.d. samples 
date back to theorems of Cramer C 5] and Huzurbazar [11], while results 
concerning the consistency of the MLE, under assumptions that insure a 
unique MLE, may be found in Wald [17], Wolfowitz [19], and LeCam [12], 

Somev/hat more recently, Si Ivey [15] has dealt with the asymptotic properties 
of the MLE without independence. Surprisingly however, a correct proof of 
the multidimensional version of the combined results of Cramer and Huzurbazar 
on the existence of a unique consistent solution of the likelihood equations 
when multiple roots occur did not appear until 1977 in a note by Foutz [10], 

(see also Tarone & Gruenhage [16], Chanda [ 3], and Peters and Walker [14, Appendix] . ) 
Examples 1 and 2 which follow illustrate the need for a consistency theorem 
along these lines which relaxes the assumption of identically distributed 
observations. 

Example 1 (Observations with missing components): Let Xp X 2 , ... be 

independent random vectors in v/hose common density is one of a parametric 
family (o(x|0)}g^g , where 0 is a subset of r'^. Suppose that instead of the 
X^. we observe only certain subvectors B^Xp 8 ^X 2 , ...» where (B^. } is a given 
sequence of n^ x n matrices obtained by deleting n - n^ rows from the identity. 
Clearly we can assume that components are missing at random provided that the 
B/s are independent of the X^. 's. Under what conditions is there a unique 
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strongly consistent (and asymptotically efficient) local MLE of 0 based on the 
observations B2X2, ...? 

A recent paper by Dahiya and Korwar [61 illustrates that even for a bivariate 
normal sample, with several simplifying restrictions on the sample and on the 
parameters, the likelihood equation for Example 1 has multiple roots and requires 
numerical methods for its solution. 

Example 2 (Estimating mixture density parameters with sample blocks of varying 

sizes): Let f(x[Tj), f(x|T 2 )» •••» unknown but distinct members of 

a multivariate parametric family {f(xlx)}^^-]. , and let Oj, ..., be the unknown 

positive probabilities corresponding to a discrete mixing distribution supported 

on [xp ..., x^^}. The number m is known. Under what conditions will there be 

a unique consistent MLE of the parameter 0 = (op ..., ct^.p Tp ..., x^^) 

m 

describing the mixture density q(x|9) = E a.f(xjx.), based on a sample of the 

i=l ^ ^ 

type Xp X2, ...» where the X^ are independent and each X^ is itself a random 
sample X^. = (X^-p .... of known size from an unknown component density 

f(x|x^.)? In this example the parameter 0 is only locally idemtifiable. Moreover, 
it can easily occur that the likelihood function is unbounded [91; hence, the 
need for a consistency theorem for local maximizers is especially clear. 

The practical importance of Example 2 is indicated by the fact that 
estimation of mixture density parameters is often proposed as an alternative to 
the clustering of large amounts of multivariate data [181. The asymptotic 
properties of the MLE are of interest because of the prevalence of large sample 
considerations in judging cluster validity [81, even though it may be difficult 
to argue for a statistical basis for a given clustering problem. The presentation 
of the data in blocks of varying size may occur when the primary sampling units 
are grouped by physical or spatial associations (see [21 and [131 for an 
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application of this idea in the analysis of pictorial data.) 

Finally we remark that the existence and uniqueness of a consistent solution 

of the likelihood equations bears on the numerical problem of obtaining the 

estimate. Each of Examples 1 and 2 is a missing data problem (in Example 2 

the random variables which indicate the component population of origin are missing); 

thus, a natural numerical procedure for obtaining a MLE is one derived from the 

generalized EM procedure of Dempster, Laird, and Rubin [7]. Such a procedure 

increases the value of the likelihood at each iterative step; however, this is 

no guarantee of convergence, since the likelihood function may be unbounded. 

Generally speaking it is possible to show that the Hessian of the log likelihood 

is negative definite near the consistent solution of the likelihood equations. 

Thus, the generalized EM procedure is convergent to it given a good enough starting 

value (see [14] for a thorough discussion of numerical properties in the case 

of a mixture of multivariate normal distributions.) 

Throughout this paper the symbol Eg will denote expectation with respect to 

2 

a distribution determined by a parameter 6 and D , D etc. will denote differen- 

Vi U I V 

tiation or partial differentiation with respect to scalar or vector variables u, v. 

For a scalar valued function, will denote the gradient with respect to an inner 

product which will usually be understood from the context. Given an inner product 

k 

<-|-> and a vector a, the symmetric k-1 inear form f(rii . •••> Hi,) = H <o|q.> will 

k ' '' j=i ’ 

be denoted by <a]’> . Thus, for example, we may write the covariance of a statistic 

O 

S as Cov^(S) = E^{<S - E^(S)|-> }. The largest and smallest eigenvalues of a 
symmetric positive definite operator A will be denoted respectively by p(A) and 
o(A). 
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2. A General Consistency Theorem. Let 0 be an open subset of and for each 

positive integer r and each 0 e 0, let q^(*|6) be an fl^-variate density with 

respect to sone fixed o-finite measure on Let 0° e 0 and let 

X , ... be a sequence of independent random vectors with X having density q„(-|6°)- 
p r r 

For 0 e 0 define 

P 

L (e) = Z log qp (X |0) 

H r=i 


Theorem 1 : Suppose 

(i) Dq % dX^(x) = 0 , 

<”> (x|0°) dX^(x) = 0 , 


and that there is a constant M, functions f^, a 
sets A^ in R^*>" such that for all r, 0 e il,x <' A 

0 ., 0 

i j k 


neighborhood of 0° and X^-null 
r’ 

i» j> k — 1> •••» V 


(iv) 


E0o{fr(X^)^} ^ 


(v) 

(vi) 


Ego{[Dg_ log q^ (X^|0°)]^} < M 




1 


t Di „ q, (X,ie°)]^) s M 


q,(X,|0°)2 ‘ 


1 = 1 , .... V 

i, j = 1, .... V 


and 

(vii) 


there exists e 


1 n 

> 0 such that — Z J^(0 ) ^ 
Pr=i ^ 


for sufficiently large p, 


where J^(0°) = Eqo(Vq log q^ (X^|0°) vj log q^ (Xj.|0°)}, is the identity on 
R^, and the ordering is the usual one on symmetric operators. Then there is a 
neighborhood f2° of 0° such that with probability 1 there is an integer Pj such 
that for p ^ P2 there is a unique solution 0^ in of the likelihood equation 
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D_L (6) = 0. Furthermore, 9*^ -»■ 0° as p ^ and 6^ is a naximum likelihood 
0 P 

estimate. The consistent estimator 0*^ is asymptotically normal and asymptotically 
efficient. 

Proof : In the proof we make repeated use of the following version of the strong 

law [4, p. 103]: let Z^, ... be uncorrelated random variables such that 

in, 

the variances of the Z. are bounded. Then — Z \1 - E[Z ]) 0 a.s. as n -> <». 

1 "j=l ^ J 

1 P 

Let Sp(0) = p Dglog q^(X^|0). By (i) EQo{Sp(0°)} = 0 and by 

r~ 1 .1 

(v) S p(0°) -»■ 0 a.s. as p ->■ <». Consider the vxv matrix DgSp{0°) whose i, 

element is 


\ 

P 


P 

Z D 
r=l 


0i,0. 


log q^(X^|9°) 


i ^ 

P r=l q^(X^|0° 



q,(xj6°) 


- ^ E D log q (X |9°)D log qJX |0°). 
p r=i «i ^ r Oj ^ 

By (li) the expected value of the first term on the right is zero. Hence, by 
(v) and (vi) 

Vp< 0 °> " f * 0 


a.s. as p -»■ Thus, with probability 1, if 0 < n < ^/2 there is p^ e N 
so that for p > p 

DgSp(0°) < -2nl . 

Without loss of generality we can assume is convex. For 0 e n. 


i J |o^ „ log q.(xje) - d2__„ log q^(xje°)| 


p i“0. ,0. 'ir'''r'''' ‘"0. ,0 

^ r=l 1 ’ j 1 ’ j 

P ^ n 1 

■\0 1 z*! I 


= P ?, - «kl 4 l“o..6..0 

1 K X 1 J K 


sis S - Oklyx^) 


r=l k=l 
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With probability 1, for large p 

P 

E 

r=l 

< 1 + . 

It follows that for any particular norms on R'' and on the symmetric vxv matrices 
there is a constant M such with probability 1 there is a positive integer p^^ 
such that for p > Pj, 0 e 

l|D0Sp(e) - DQSp(e°)|| < M||e - e°l| . 

Thus there is a convex neighborhood of such that 

DeSp(0) . - ni 

for all 0 € p > Pj. It now follows that for p s Pj Sp is one to one on 

and that the image under Sp of the sphere Q^(Q°) at 0° of small radius 6 

contains the sphere Sp{0°) of radius n6. Since 0 is eventually 

in ^^5(Sp{0°)) there is a unique solution of D0Sp(0) = 0 in fig(0°). Since 

D0Sp(0) is negative definite, this solution is a MLE. 

Let The Cram6r-Rao lower bound for p observations is 

n n r ' ' 


P P,=i r 


.-1 


verified without difficulty to be (p ^p)~ • By (v), (vii), and Liapounovs 
Theorem [4, p. 200], p^ asymptotically distributed as N (0, I). 

Moreover, in a neighborhood of 0° we may write 

Sp(0) = Sp(0°) + A(9)(0 - 0°) 

where A(0) -> DgSp(0°) as 9 0°. It follows that with probability 1. 

p^" Ep" (0P - 0°) = - Ep" A(0P)"^ E^ p^ Ep** Sp(0°) 

for large p. Since DgSp(0°) + I^p 0 and A(0P) ^Q^p(0°) with probability 1, 
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I' D -1 ^ 

the expression -Ep A(0^) Ep converges almost surely to the identity. Therefore, 

Ep (0*^- 0°) is asymptotically N^{0,I) and 0*^ is asymptotically efficient. 

This concludes the proof, 

3, Appl i cations. 

Suppose that in Example 1 the have a common n variate normal distribution 
M^(y, E) and it is desired to estimate p, E by maximum likelihood based on the 
observed components B^X^, B 2 X 2 , ...» likelihood equations for p and 

E are 

(3.1) E B^(B EbL'^ B y = E b[{Q ZbI)~\x . 

r=l r r r r r r r r 

and 

(3.2) ^^b]!(B^I:b];)'^B^ = E^sJ(B^EbJ)"^B^(X^ - y)(X^ - y)^ . 

and have no explicit solution, although for given E (3.1) may be solved explicitly 
for y provided that the matrix an the left of (3.2) is invertible. 

Components i and j are paired in the observation B^X^ if both the i^” and 
columns of B^ contain a 1. Let (J>(i, j, p) denote the relative frequency 
with which the i^^ and components are paired in the first p observations 
B,X,, ... , B X , and let 4),(i, j) = lim (j)(i, j, p) . 

^ p-).oo 

Theorem 2 : Let Xj, X 2 , ... be independent, identically distributed according 

to ^^(y, E). If c|)j(i, j) > 0 for all i, j = 1, ..., n, then there is a unique 
strongly consistent solution of the likelihood equations (3.1) and (3.2), which 
has the asymptotic properties given in Theorem 1. 

Proof : The only one of conditions (i) - (vii) in Theorem 1 which poses any 
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difficulty is number (vii). For 6 = (p, Z), the information matrix J^(6) 
corresponding to the density of S^X^, 


q (-le) = (B u. Q ZbI) , 


IS 

(3.3) 


J,(e) = 


u^(e) 

0 

0 

u^(0) ^ u^(e) 


where U^(9) = , and the Kronecker product U^(e) ® U^{e) 

represents the symnetric operator on n x n real symmetric matrices S (with 
trace inner product) defined by U^(0)SU^(0) . Thus (vii) is satisfied if for 
each Z there exists e = e(Z)>0 such that for all p sufficiently large 

(3.4) 


i z z'^bJ(b zbJ)'^b Z > clh 
Py,= l r r r r 


and 

(3.5) ' ^ Z TrCBj(8^ZBp'\s.l^ > eTrS^ 

for all Z e and symmetric S. However, (3.5) implies (3.4), as can be seen 
by taking S = ZZ^. Hence, it suffices to establish (3.5) under the stated 
hypotheses. 

TrCBj;(B^ZBj;)’^B^s/ 

= TrC(B ZBI)-^(3’'SBI)1^ 


Now, 


r r' 


9 


But, 


= TrC(B^ZBT)-^(B^SBp(B^ZBT)"^] 

^ aC(B^ZBp"^^0 (B^ZB];)-’'^] TrCB^SB^]^ 


® (B^ZB];)'^^3=l/p_aB^ZB];)*'^ ® (B^ZbT)'^] 


and 



9 


(B = sup Tr(B ZBb’^ A (B ZB^) A (B ZB^)^^ 

r r r r r r r r r r 

= sup Tr[(BZBbA]^ 

TrA^<l 

= sup TrZB^AB ZB^AB^ 

TrA2<i r r r r 

= sup TrCzVAS Z^^]^ 

TrA="<l ^ ^ 

< pCZ^"0 zS sup TrCB^AB^]^ 

TrA^'si 

= pcz’^o z^] . 

The last equation follows from 3 B^ = In • Hence, 

r r '•r 

TrCBj^(B^ZBj)'^^S]^ > o[Z"'"^ 0 z"’^"1 TrCB^SB^^/ 

= aCZ'^" 0 Z"'^3 TrCBj^B^SBj^B^l^ . 

Therefore, 


^ Z Jr[Bj(B^ZBj)"^B^S]^ > afZ’*'" 0 Z"*'"] ' ^ Z ^rCB^B^SB^B^]^ 




•1 


> aCZ'^0Z'=']aCi& (b;,B^) 0 (BX)3TrS 


V=1 


r r' 


Since eventually 

a[i Z (bIbJ 0 (B^B^)] > |-nin (j>,(i,j) . 

Pr=l ^ i,j 

(vii) follows upon taking c = ^ min 4)i(i,j) • pCZ^0Z*l ’ QED. 

i.J 

The second application of Theorem 1 is to the problem outlined in Example 
2. We assume that the unknown component densities f(x|T^) are from a regular 
exponential family (see [11 for definitions) with minimal canonical representation 
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(3.6) f(xlx) = C(t) exp <t|F(x)> (t e T) 

with respect to a a-finite measure X, where T is an open subset of a finite 
dimensional space V with inner product <-|->. Vie also assume that for distinct 
^1’ functions together with any 

components of F(x)e*^^* I F(x)e‘^^'^l are linearly independent 
[X]. The joint density of = (X^j, given that X^ is a sample 

from f(x|i£) is 

(3.7) " Y^(Tj^)exp<xj^| G^(x^)> 

where x^ = (x^p .... x^j,j^) 

Yr(Tj) = C(tj)"i- 
and 

Hr 

GJx ) = E F(x ) . 

r r rj 

The log- likelihood for the parameter 0 = (ap a^_j» Tp .... x^^) of 


Example 2, based on the sample Xp 


X 


P 


(3.8) 


Lp(9) 


= y- log q^(X^|e) 
r=l 


is 


where 

(3.9) 


m 

q^(X^|9) = Z 

j6— 1 


and p (X Ixn) is given by (3.7). The following lemma collects some facts 
r r ^ 

about exponential families which we require. For proofs, see Barndorff- 
Nielsen Cl] . 
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Lema 1 : Let (1) be a canonical representation of an exponential family. 

For T e T let k(t) = - In.C(T) = In / exp<r | F(x)>dX(x) 

R" 

Then 

(i) For each t e T, F(x) has moments of all orders with respect to 
f(x|T); 

(ii) k(t) has derivatives of all orders which may be obtained by 

differentiating under the integral sign. D^k(t) may conveniently 

be represented as a synmetric k-1 inear form on V whose coefficients 

are polynomials in the first k moments of F. In particular, 

(iii) D^k(t) =<E (F)j-> = /<F(x)|*>f(xlT)dX(x) 

T T rP 

and 

(iv) D^k(t) = cov (F) = /<F - E (F)|->^f(x|T)dX(x) ; D^<(t) is 

I I I ^ 

positive definite. 

(v) <(t) is strictly convex on T. 

We are now ready to establish consistency of the MLE in Example 2. 

Theorem 3 : If the numbers { are bounded and Lp(0) is given by (3.8) 

then with probability 1 there is a unique consistent solution of DQLp(9) = 0 
which, moreover, is a MLE of the parameter 0° = (a^, ..., ..., x°) 

and is asymptotically normal and efficient. 

Proof: Write u„(t.) = E (G ) ; u(xj = E (F). Using Lemma 1, the nonzero 

r X, ^ 

derivatives of qj,(x^|9) up to order 2 are: 

(3.10) ‘ Pr^V^'^m^ ’ 1 - ^ 

D^^q^(x^|0) = a£Pr(vl'^£^^r^’^r^ " ’ 


(3.11) 


1 < ii, < m 
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(3.12) 

(3.13) 

(3.14) 


’ 1 ^ ^ ^ n-1 


V a 


T^,a^ ’ 1 ^ ^ ^ >^-l 


q^(x^|e) = ’ 1 < £ ^ m . 


Conditions (i) and (ii) of Theorem 1 follow immediately from (3.10) - (3.14). 
Similarly, using Lemma 1 and the boundedness of {M^}, conditions (iii) - (vi) 
of Theorem 1 are readily verified. It remain to verify (vii). We may write 
Jp('l^) in matrix form as 


J^(0) = 


0 N'=*4 

r 2 


A 

r r 


* 

B C 
r r- 


0 


where I, and K are respectively the identity operators on and V*” and 




'r Pp(yltt) - P,(X,|t^)1Cp,(X^|t^) - p^(x^|tJ 1 


a, k = 1, .... ni-1 


B = 
r 


Cr = 


'VrfVl^k’l^Pr'VlV - Pr<VlV^ ..-h 

; — j — — — N <S - p_(T|^) 

qr(X^|0)^ r r r k 


'“t“k'’r<''rl^t>Pr<’'rl^k) .,-1 


q,(X^|e)‘ 


'‘r - Pr<^k)>"^ - Pr'V 


Z = 1 m-1 

k 1, ..., m 

k, 5, = 1, . . . , m. 


The assumptions concerning the linear dependence of the functions exp<TlF(x)> 
and F(x)expi T I F(x)> insure that J^(0) is positive definite for each r. 
Condition (vii) will be established once it is shown that the smallest 
eigenvalue of J^(0) is bounded away frojn zero as ->^ «>. 
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Clearly, 


o(J^(0)) > a Eq 


A 

r 

* 


r J 


Observe that 


Pp ( XrKp ) 1 

= exp - x(t^) - <Tj . G^>]} . 


Pr<VI''k> 

If is a sample from f(x|T|^), then the expression in square brackets 
converges to 

k(Tj^) - k(tj^) -<t^ - = k(tj^) - k(t|^) - • (Tj^ 

which is positive by the strict convexity of ic. Hence, 

Pr ( X ^ lT ^) 


MV(v ^ 


Therefore, 


p 

Pr<Vl’’«>Pr<Vl''k' 

- F 

Pr(XrKn) 

^8 

qr(X^|0)^ 

^k 

qr(Xrle) 


converges to 0 if £ k and ^ if £ = k as H^ «>. Thus, 

k 






1 . ^ 
2 

a a. 
m k 

-1 


as N. 


Given that X is from f ( x | T . ), N ^{G - y ^( x . )) converges in distribution 

r K f r f K 

to a normal random variable Z with mean zero and covariance cov (F). 

^k 

Hence, 

-i' 
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converges in distribution to 0 if £ k and — Z if £ = k. 

“k 


Let A be any element of V and consider 

- u^(t,^)|A>]'^ = rr^C J'<F(X^j) - (F)|A>]'^ 

J ^ k 

After expanding and taking expectation with respect to it will be seen 
that the only nonvamshing terns are those of the form 

[<F(X .) - E (F)|A>^<F(X ) - E (F)1a>^] 

Tk n Tk r£ Tk 

of which there are ^ (2*^) “ 


is bounded as It follows from a standard theorem on convergence of 

moments C4, p. 95] that 


E 

T 


k 





<®r - 


-»■ 0 as N 

r 


GO 


Thus Eq(B^) ^ 0. Similar reasoning shows that 


as ^ °o. Therefore a(J^(9)) is bounded away from 0 and this concludes 
the proof. 


4. Concluding Remarks . 

Theorem 3 remains true under weake assumptions then the boundedness 
of the sample sizes but nothing like the approach embodied in Theorem 
1 will work without some restrictions on N^. Nevertheless, it is far from 
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intuitively clear that restrictions are needed for the existence of a 
consistent MLE. Similarly, it seems plausible that the assumption in 
Theorem 2 that components be paired with nonzero asymptotic frequency 
might also be weakened. In certain cases, e.g., when a normal mean is 
to be estimated from data with missing components and the covariance is 
the identity, the existence of a consistent MLE with desirable asymptotic 
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SUMMARY 


In this paper we investigate the problem of estimating the 
parameters for a mixture of densities from, possibly distinct, 
exponential families. The likelihood equations used by Hasselblad 
(1969) are necessary conditions for a local maximum of the likeli- 
hood function. We show that a particular repeated substitution 
scheme, determined by the likelihood equations, converges locally 
to the strongly consistent maximum likelihood estimate. This 
generalizes the results of Peters and Walker (1978). 


Some key words: exponential families, maximum likelihood estimate, 

mixture densities. 
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1 . Introduction 

Let X be an n-dimensional random variable whose density p (with 
respect to some a-finite measure) is a convex combination of densities 
p^ , where each p^ belongs to some exponential family, i.e., 

m 

p(x) Pi(x) 

a° > 0 I aV = 1 
i i = l ^ 

P^-(x) = h^.(x) exp < q°, 


and where <', ’ > is an inner product on R '* defined by <x, y> = 
x"" Z. y. 

N n 

If {x. } IS an independent sample on R then a maximum likelihood 

k=J ^ 

estimate of {a° , q° } is a choice of parameters {a^ , which locally 

maximizes 


IN , , 

L = - I log p(xj 

N k=l 


with {a,, q,}^ replacing {a°, q°}"’ in the evaluation of p. 
^ ^ i=l ^ ^ i=l 



If we assume that this choice is to be made from some open neighbor- 



then a necessary condition for a local maximum is that 


' N p(x^) 





N j 

k=l p(X|^) / 


1 

N 


N 

1 

k=l 


P,(X|,) 


where 6^ = (f^). 

Equations of this type will be referred to as likelihood equations 
and these were introduced by Hasselblad (1969) for the case that each p^ 

belonged to the same exponential family. We will see that this restriction 
IS not essential. The case that each p^ is a multivariate normal density 
has a longer history and has been considered by Day (1969), Duda and Hart 
(1973), Peters and Walker (1978), Wolfe (1970), and others. All of these 
authors considered a particular repeated substitution scheme to itera- 
tively solve the likelihood equations. 


2. Assumptions and a change of parameters. 

At this time it is necessary to change the way each family is 
parameterized. The following lemma will provide some insight into this 
change. The lemna is essentially a rearrangement of some ideas pre- 
sented in Berk (1972) and Barndorff-Niel sen (1978) and is outlined 
below. Throughout this paper "V" will denote the Frechet derivative of 
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a vector valued function of a vector variable. For questions con- 
cerning Frechet derivatives, see Luenberger (1969). 

Lemma 1 Let p^(x,q) = r(q)h(x) exp "For qeJl^ an open subset of 

If p^lx.q) = PQ(x,q) a.s. implies that q = q, then 6(q) = E (f) is 
a 1-1 function. We also have that is an open subset of R ° and 

q(0) is a continuously differentiable function with VqP nonsingular. 

Proof In Chapter 8 of Barndorff-Nielsen (1978) we have that £(q) 
is 1-1 and infinitely differentiable Since 6(q) is continuous, it follov;s 

from the Brouwer invariance of domain theorem see Dugunji page 358 (1966)) 
that 0(fiQ) is open. We also have that 

Vq6 = VqEg(f ) = { J (f- 6 )(f- 6 )^g} E ^ . 

Since e(n) is open and E (f) = 6 it follows that V e is nonsingular. 

q q 

The final conclusion of the lemma follows from the inverse function 
theorem. 

Throughout the rest of this paper we will make the following 
assumpti ons. 

n . 

1) P^(x,q^) IS defined for each q^ e an open subset of R ^ con- 
taining q° and q is uniquely determined by p^- (x, q^ ) . 

^ t ^ 

2 ) If S is a proper subspace ofR, t=m+En , then 

i = l ^ 
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where the probability and functional evaluation are taken with respect 

to {a° e®}"" . 

^ ^ i=l 


We note that this assumption is a generalization of identifi- 
ability (see Yakowitz and Spragins (1968) and Teicher (1963)). That 
this is a nontrivial change can be seen in the following example. 

Example Let p-j(x) = le”^^ and P£(x) = t^xe"^^. Clearly p., and 
p^ are identifiable. We now observe that 


P] ( ^1 * 6*1 ) “ P-j ( ^ ) 


and so 


P,(f,- 0,) + Ipj - jp, ' 0. 

By defining 0^ = (f^) and using lemma 1 we can proceed to the new 

parameterization of p^ , i .e , 

P^(x,e^) = r^(6^)h^(x)exp <q^(e^), f^(x)>.. 


This change in parameters does not change the necessary conditions 
for a local maximum of L. 



We now consider a statistical property of solutions to the 
likelihood equations. The following lemma is a consequence of the fact 
that the conditions of Chanda (1954) are satisfied by p(x) and is 
offered without proof. The reader is referred to Peters and Walker 
(1978) for further discussion. 

Lemma 2 Given any sufficiently small neighborhood of the true 
parameters, with probability one as N approaches infinity, there is a 
unique solution to the likelihood equations in that neighborhood 
and this solution is a maximum likelihood estimate. 

This solution is called the strongly consistent maximum likeli- 
hood estimate. 

3. THE GENERAL ITERATIVE PROCEDURE 

A natural iterative procedure for solving the likelihood equa- 
tions is suggested by their fixed point form. We generate a sequence 
of estimates by repeatedly substituting the last estimate into the 
right hand side of the likelihood equations. This generates a new 
estimate. Hasselblad (1969) and Day (1969) have shown many examples 
where this work. Peters and Walker (1978) have proven that if each 
p^ is a multivariate normal density, then this procedure converges 
locally to the strongly consistent maximum likelihood estimate. Our 
proof of the local convergence for exponential families generalizes 
this result and the proof is patterned after their argument. Before 
we proceed further it will be helpful to introduce some notation 
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M 

Since 6^ ranges over an open subset of R \ the natural 


parameter space is a subset of 


^ _ n, n 

R^ = R^©R'e ...©r"’ 


where t = m + Z n . We then have that 
i = l ^ 


IS an element of R . If for i=l,..., m we let 

A,(y) ‘ - I 
' N k=1 » 

„ r 1 - 1'? !i!l Ay !l 

’ " k=l P / N k-1 P ’ 


then the liicelihood equations become 


-/A(y)\ 

^M(rV 


where A 


and M = 
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Equivalent to equation 2 is 

Y = = 0-e) He . 

We define the repeated substitutions scheme by 

i+1 * / i\ 

T = $£. (y ) . 

The operator is said to be locally contractive near a point 
if for some norm | | • | ] on there is a number 0 £ X < 1 such that 

1 1 (y") - yM 1 X 1 1 y' - X 1 1 

whenever y is sufficiently close to y* 

4. LOCAL CONTRACTABILITY 
We will now establish the following theorem. 

Theorem 1 . With probability one as N approaches infinity, is a 
locally contractive mapping (in some norm) about the strongly consist- 
ent maximum likelihood estimate whever o < c < 2. 

Proof. For any norm on R^ one can write 

(y') - y = (y) [y"~ y] + 0 1y -y' 1 1 

where y is a solution to the likelihood equations. We can see that 
the theorem will be proved if one can show that with probability one, 
V4>^ converges to an operator which has norm less than one. 
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We can write V4>^(y) as a matrix of prechet derivatives 


V$^(y) = (1-c) I + e 




We recall that 6^ is nonsingular and since 

_x 

we have that Vq q. is positive definite with respect to the usual inner 
product on R \ So we define for i=l m hy 


<x, y y 


and let b^. = p^/p. 

By direct calculation, using the likelihood equations, we see 
that if y is the strongly consistent maximum likelihood estimate then 



T 


1 ^ 

V^A(y) = I - (die^g a^) i i 


b,(x^) 


k'l \ b^(xk) 



b,(X|^) 


V*k> 


1 N 
VeA(v) • - i I 
k= 1 


bi(xj^)\ /< b^(x,^){f^ (X|^) - 6^}, *>i 

KK^' X‘'m<''k>'V’'k' - V- 


I 


N 


b,(x^Hf,(X|^) - e,} \ /b,(X|^) 


V M(y) = - diag a. ^ i 
a IN 



■ «n.l ' 


1 ^ 

V,M(y) = (diag I v,^p,(x^) 


N k=l 


b,(X|^Uyx^) - e,)\ /<b,(x^)(f,(X|,) - e,), ■>, 



- 6„)/ Vb„(x^)ffjx,) - e^t, -y 


We observe that (y) can be written as 

e 


(y) = fT 2 y) 

k=l 


where V^F(x, y) exists and has the property that for any norm ||- || on 
V^F(x, Y ) there exists a real valued function g such that 
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1 1 •"(><. y) II 1 g(x) 


and 


J*g(x) p(x, Y°) < 


00 


for every y in some neighborhood of y°* It follows from this that 
evaluated at the maximum likelihood estimate converges to 
E{V $ (y°)}. Hence it will suffice to show that in some norm M-ll, 

Y £ 

E{V ?> (y°)} has norm less than one. 

Y £ 


Let 


/ (x) 


\ 


V(x) = 


b (x) 
m 


b^(x){f^(x) 


e,) 


\ 


b„(x){f„(x) - e„) 


and let < • , • > denote the inner product induced on by scalar 

multiplication and < * , ' i=l, m. 

Since 

E 


1 ^ ^i^\^ ) 


= ’e.«l = I 


have that 
E{Vi'^{Y°)} = 


'diag o\ r 

0 ijj'' <''• ■> P ■ 
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We can denote this as I - cQR where 


'diag a 


Q = 


1 0 
0 I 


and 


R = y'v <V, ■> P • By assumption 2 we have that QR is 


positive definite with respect to <• , Q"^ •> . The theorem will be 


proved if it can be shown that for 

r. 




'm 


e R^ 


that < W, Q'^[QR]W > = <W,RW > < < W, Q’''w > . 

By an application of Swartzes inequality and the fact that 

(Vq6) 2 = f (f.9)(f-e)'^Pp 

we have the following. 

<W,RW > = /<W,V > <V,W > p 


■/[ 


.0. Pi 


m V y.p. 

Z 1-1—!- + <2 , (f -9'') — >. 
.^1 I p ^ i’ ' 1 p ^1 


^/i 


(f, - e?) > !' " 

= 1 ( 0 aV ' 1 ) p 

Ot 1 

1 * 
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= <W, Q“''w> . 

This completes the proof. 

We now consider a useful generalization of this theorem. Consider the 

case that the random variable X is a mixture of densities p^. , i=l 

m+k for k>o, where each p^ is from some exponential family for i=l,...m 
and where p^ is an arbitrary but completely determined density for 
i = m+1,..., m+k. The appropriate likelihood equations are 



N 

Z 

k=l 


CjP((X|^) 



i = 1 


m+k 


N ^.,1 P(X|() 


1 P.(x.) 


Let be the appropriate operator determined by these likelihood 
equations. It can be seen that the proof of Theorem 1 can be easily 
extended to prove the following theorem. 
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Theorem 2 Let assumption 1 be satisfied for i=l,...,m and suppose 

t ^ 

that whenever S is a proper subspace of R , t = m+k+E n., then 

i=l’ 



It follows that with probability one as N approaches infinity, is a 
locally contractive mapping (in some norm) about the strongly consistent 
maximum likelihood estimate whenver o < c < 2. 



5. DISCUSSION 


We observe that Theorem 1 is sufficiently general to include most 
exponential families and almost arbitrary mixtures between such families. 

In fact, it covers mixtures between families where the associated measures 
are not equivalent. Theorem 1 also applies to many situations where some 
subset of the usual parameters are known or where the parameters are 
constrained. 

It should also be pointed out that although Theorem 1 applies to 
mixtures of multivariate normals, it is not based on the traditional likeli- 
hood equations. Instead of iterating on the covariances, the procedure up- 
dates the non-central second moment. This results in a different iterative 
procedure, whose difference is more than cosmetic. The difference in the 
updated covariances is given by (y.j - y.j ) (u - y.,-) where is the new 
estimate for the mean given y.^ . However, there seems to be no practical 
difference between the two schemes and one has to favor the Peters and Walke 
scheme since it involves the covariances directly 

Finally, we observe that the remarks made by Peters and Walker (1978) 
concerning the optimal choice of e are applicable to this paper and the 
reader is referred to their paper for a discussion of this. 
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SPATIAL CORRELATION IN LANDSAT 


AN EMPIRICAL STUDY 

William A. Coberly 
The University of Tulsa 


1 . INTRODUCTION 

Data analysts who have worked with LANDSAT data have 
observed that neighboring pixels are not independent mea- 
surements on disjoint areas of the target scene. This 
spatial correlation or dependency is induced by a number 
of factors - overlap of the instantaneous field of view 
(IFOV ) , atmospheric scattering, optical and electro-mechan- 
ical components of the sensor system. These factors are 
are in addition to any intrinsic spatial correlation which 
might exist in the target scene. This spatial correlation 
violates a number of assumptions usually made in the digital 
processing and analysis of LANDSAT data, especially the 
statistical analysis. A few studies (1,2 ) have inves- 
tigated Its effects on the accuracy of various statistical 
procedures. However, a more fundamental analysis of spatial 
correlation rs required in order to enhance our understandin 
of LANDSAT image representation and modelling. In partic- 
ular, a better understanding of the boundary or mixed pixel 
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phenomenon requires the incorporation of spatial correla- 
tion into the model. 

Two approaches should be undertaken. First an analyt- 
ical determination of the spatial correlation induced by 
the atmosphere and the sensor system, based on a linear 
system representation of these factors should be made. 

The second approach is an empirical determination of the 
spatial correlation structure. This is the purpose of this 
exploratory study. 


2 . SPATIAL CORRELATION 

A complete study should consider the two dimensional 
properties of spatial correlation. However, in this study 
only the one dimensional characteristics, in the direction 
of the scan line, will be studied. This is a reasonable 
start since a number of the factors, such as detector re- 
sponse and electronic amplification and recording, are one 
dimensional . 

Define X^, • • • , to be the random digital 

measurements along one scan line for a single channel of '■ 
the multispectral scanner. Let m^ = E ( ) be the mean 

value of X for i = 1, ,L. Then the autocovarrance 

function is given by 
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Y( i,i+k ) = E(( )( )). 

We now impose the assumption of covariance stationarity , 
which may not hold for large scan angles, but should be a 
reasonable assumption for small scan angles. Now y de- 
pends only on the lag k, and is independent of scan line 
position 1 . That is, 

Y( k ) = Y( i/i+k ) . 

That is, we are assuming that the distribution of the 
pixels along a scan line is covariance stationary, changing 
only in mean. Note that y(0) is the variance and the 
autocorrelation (spatial correlation) is given by 

p(k) = Y(k)/Y(0) 

for k = 0 , 1 , • • • . 

3. ESTIMATION OF THE MEAN 

The mean function m^ is, of course, in general not 
known. However, for the segments used in this study, digital 
ground truth was available and this suggests a way to esti- 
mate the mean for each of the pixels. The digital ground 
truth IS tabulated at the subpixel level, six subpixels per 
pixel according to the following scheme. 
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Scan Line 

»>- 


If the pixel has the same ground truth label assigned to 

/ 

each of the six subpixels, then it is said to be "pure". 

A "field" IS an interval along a scan line of pure pixels 
with the same ground truth label. A "field" may be one 
pixel in width or many. Pixels which are not "pure", that 
IS, those containing conflicting subpixel ground truth 
labels, will be called "boundary" pixels. 

The estimate of the mean function for a scan line is 
defined as follows: 


/s 



field mean of if 

contained in a field 


a moving average if X 
IS a boundary pixel 


V. 


The moving average used is 




)/ 8 . 




5 


In Figures 1-8, the pixels are plotted (solid lines) 

superimposed on the estimated mean function m^ (dotted 
line) for the four LANDSAT channels and the two tassel-cap 
coordinates "brightness” and "greenness". One scan line 
for two acquisitions of each of four segments is presented. 


Pixels 


Estimated Mean 




100 00 


125 00 


150 00 


175 o: 


o: 


and estimated mean plot for 
5, line 62. (a)-(d) channels 

:ness, (f) green coordinate. 





Figure 1. Continued 
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Figure 2. Pixel radiance and estimated mean plot for 

segment 1618/235, line 62. (a)-(d) channels 

1-4, (e) brightness, (f) green coordinate. 
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Figure 2. Continued 
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DO 25 00 50 00 •75.00 IDD DO 125 00 ISO Du 175 DO 2" C2 



(c) 


u 


V 


00 



Figure 3. Pixel radiance and estimated mean plot for 

segment 1633/129, line 62. (a) -(d) channels 

1-4, (e) brightness, (f) green coordinate. 
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Figure 3. Continued. 
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Figure 4. Pixel radiance and estimated mean plot for 

segment 1633/236, line 62. (a)-(d) channels 

1-4, (e) brightness, (f) green coordinate. 
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Figure 4. Continued. 
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Figure 5. Pixel radiance and estimated mean plot for 

segment 1642/145, line H . (a)-(d) channels 

1-4, (e) brightness, (f) gr-een coordinate. 
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Figure 

5. Continued. 
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Figure 6. Pixel radiance and estimated mean plot for 

segment 1642/236# line 11. (a)-(d) channels 

1-4, (e) brightness, (f) green coordinate. 
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Figure 7. 


Pixel radiance and estimated mean plot for 
segment 1645/145, line 62- (a)-(d) channels 

1-4, (e) brightness, (f) green coordinate. 







Figure 7. Continued 
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Figure 8. Pixel radiance and estiinated mean plot for 

segment 1645/236, line 62. (a)-(d) channels 

1-4, (e) brightness, (f) green coordinate. 
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Figure 8. Continued. 
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4. ESTIMATING THE SPATIAL CORRELATION 


For a given scan line and channel, the sample spatial 
correlation is calculated by 


Y( k ) 


^ L"k ^ 

L , I < > ' * 

1=1 


i+k'^’i+k 


) 


and 


p(k) = Y(k)/Y(0) 

for k = 0,1, ••• . In this study the sample spatial 
correlation was calculated for every third scan line for 
each of the four channels on each segment acquisition. In 
Table 1 the average spatial correlation function over all 
scan lines used in the calculations is tabulated for two 
acquisitions for each of four segments. Although the coef- 
ficients are not the same from segment to segment, the pat- 
tern IS very consistent. The lag 1 correlation is distinctly 
non- zero over all segments and channels and the lag 3 and 
larger order correlations are essentially zero. The lag 2 
correlation is zero for some cases and non-zero for others. 

In Figures 9-16, the histograms of the estimates for 

/V /V ^ 

p(l) and p(2) and the scatter plots of p(l) versus p(2) 
are presented for all scan lines processed in the study. 
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5. BOUNDARY PIXELS AND SPATIAL CORRELATION 

The spatial correlation observed has considerable 
implications in the characterization of boundary or 
mixed pixels. The usual notion of mixed pixel is one in 
which the instantaneous field of view intersects at least 
two real label classes in the target scene. In fact, 
spatial correlation may induce the mixed pixel effect even 
when the IFOV target is composed of a single class, due to 
the mixing of neighboring pixels by the correlating mech- 
anism. By understanding this spatial correlation phenom- 
enon, better automatic boundary finding or field finding 
algorithms, specifically developed for LANDSAT data appli- 
cations, should result. 
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TABLE 1. Estimated spatial correlation functions. 


Lag 


Segment 

Chan 

1 

2 

3 

4 

5 

6 

1618/145 

1 

. 22? 

-.036 

-.030 

-.039 

-.051 

-.056 


2 

. 3C2 

-.051 

-. 048 

-.058 

-.080 

-. 072 


3 

. 352 

-. 060 

-.100 

-.087 

-.078 

-. 081 


4 

. 3C6 

-.053 

-.092 

-.087 

-.089 

-.081 


1G18/235 1 

.445 

.114 

-. 004 

-.075 

-. 088 

-.08 9 

2 

. 501 

.129 

-.008 

-.077 

-.114 

-.118 

•5 

. 486 

.097 

-.03^ 

-.081 

-. 096 

-. 090 

4 

. 486 

. 091 

-. 029 

-.064 

-. 079 

-.08 9 


1633/129 1 

. 309 

-. 015 

-. 029 

-.044 

-. 046 

-. 0<^5 

2 

. 378 

-. 005 

-.040 

-.039 

-. 054 

-. 065 

3 

. 387 

.017 

-. 019 

-. 048 

-. 078 

-. 094 

4 

. 421 

.046 

-. 007 

-. 035 

-.071 

-. 083 


1633/236 

1 

. 335 

.032 

-. 023 

-.042 

-.048 

-. 0<^1 


2 

. 446 

.058 

-.042 

-.072 

-. 090 

-. 08 3 


3 

. 396 

.035 

-. 034 

-. 068 

-.090 

-. 087 


4 

.432 

.051 

-.031 

-.072 

-.083 

-.07/ 
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TABLE 1. Continued. 


Lag 


Segment 


1 

2 

3 

4 

5 

6 

1 642/1-15 

1 

.213 

-.057 

-. 060 

-. 050 

-. 064 

-. 045 


2 

.337 

-. C 24 

-. 055 

-.066 

-. 070 

-.072 


3 

.365 

. 003 

-.023 

-.054 

-.077 

-.07? 


4 

.393 

. 007 

-. 029 

-. 0/'6 

-.062 

-.069 


16- J 2/236 

1 

.219 

-.033 

-.031 

-. 044 

-.059 

-.0^2 


2 

.310 

-. 019 

-. 053 

-. 070 

-. C 8 E 

-. C"1 


3 

.354 

.014 

-. 050 

-. 079 

-. no 

-.109 


4 

.406 

. 015 

-. 0^6 

-.076 

-.316 

-.11? 


1 64 5/14 5 

1 

.109 

-.066 

-.011 

-. on 

-. 021 

-. 015 


2 

. 178 

-.108 

-. 047 

-.023 

-. 015 

-.009 


3 

, 260 

-. 040 

-.034 

-. 035 

-.039 

-.051 


4 

.293 

-. 027 

-.029 

-. 024 

-. 021 

o 

• 

1 


1 6^5/236 

1 

.343 

-. 005 

-. 045 

-. 060 

-. 070 

-. 073 


2 

.424 

. 013 

-. 071 

-. 087 

-.083 

-.097 


3 

.426 

. 023 

-. 049 

-.061 

-.076 

-.091 


4 

.441 

.033 

-.043 

-. 058 

-.068 

-.081 
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Figure 9. Histograms for 1618/145. (a)-(d) Lag 1 spatial 

correlations for channels 1-4. (e)-(h) Lag 2 

spatial correlations for channels 1-4. Computed 
for every third scan line. 
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Fiqure 10- Scatter plot=; of I.aq 1 spatial correlation versus Laq 
spatial correlation for 161B/145. (a)-(d) channels 1 

Computed for ('very third scan line. 
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Figure 11. Histograms for 1618/23'3. (a)-(d) Lag 1 spatial 

correlations for channels 1-4. (e)-(h) Lag 2 

spatjal correlations for channels 1-4. Computed 
for every third scan lino. 
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Figure 13. Histograms for 1633/129. (a)-(d) Lag 1 spatial 

correlations for channels 1-4. (e)-(h) Lag 2 

spatial correlations for channels 1-4. Computed 
for every third scan line. 
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Fiqure 14. Scatter plots of T.aq 1 spatial correlation versus Lag 2 
sfjatial corrolation for 1633/129. (a)-(d) channels 1-4. 

Computed for every third scan line. 
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Figure 15. Histograms for 1633/236. (a) -(d) Lag 1 spatial 

correlations for channels 1-4. (e)-(h) Lag 2 

spatial correlations for channels 1-4. Computed 
for every third scan line. 





34 



• .7n ,;o ,»n «.fu ,?u ,»v 

Figure Ifi. Scatter plots of Lag 1 spatial correlation versus Lag 2 
spatial correlation for 1633/236- (a)-(d) channels 1-4. 

Computed for every third scan line. 
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Figure 17. Histograms for 1642/145. (a)-(d) Lag 1 spatial 

correlations for channels 1-4. (e)-(h) Lag 2 

spatial correlations for channels 1-4. Computed 
for every third scan line. 
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Scatter plots of Laq I spatial correlation versus Lag 
spatial correlation for 16'12/145. (a)-(d) channels 1 
Computed for every third scan line. 
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Figure 19. Histograms for 1642/236. (a)-(d) Lag 1 spatial 

correlations for channels 1-4. (e)-(h) Lag 2 

spatial correlations for channels 1-4. Computed 
for every third scan line. 
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Figure 20. Scatter plots of I.ag 1 spatial correlation versus Lag 
sfiatial correlation for JG42/23G. (a)-(d) channels 1 

Computed for every third scan line. 
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Figure 21. Histograms for 1645/145. (a)-(d) Lag 1 spatial 

correlations for channels 1-4. (e)-(h) Lag 2 

spatial correlations for channels 1-4. Computed 
for every third scan line. 
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Figure 22- Scatter plots of Lag 1 spatial correlation versus Lag 
spatial correlation for 1645/145. (a)-(d) channels 1 

Computed for every third scan line. 
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Figure 23. Histograms for 1645/236. (a) -(d) Lag 1 spatial 

correlations for channels 1-4. (e)-(h) Lag 2 

spatial correlations for channels 1-4. Computed 
for every third scan line. 
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INFORMATION IN REMOTELY SENSED DATA FOR ESTIMATING 
PROPORTION IN MIXTURE DENSITIES^ 

Virgil R. Marco, Jr., and Patrick L. Odell 
University of Texas at Dallas 
Box 688, Richardson, Texas, 75080 

I. INTRODUCTION 

Data taken remotely by multichannel sensors on a near earth satellite 
can be modeled as a collection of multivariate data points. In the 
application [1] that motivates this paper each pxl data vector repre- 
sents a measure of reflectance from (1.1) acre location on the surface of 
the earth. Each of the p elements of the data vector is a reflectance 
measure at a preassigned wave length of light. Conceptually, each crop 
class defines a set of reflectance measures that can be modeled by a 
multivariate unimodel probability density function unique for each crop 
class. 

Let there be m-crop classes and let the p.d.f. 

P^(x) — p^(xjp^,I^^) i — 1 , . . . ,m (l.l) 

denote the distribution of the random data vector X given that the 
measurements were made on the i^” crop class, n^, i = l,...,m . Also 
let the multivariate mixture p.d.f. 


Vhis research was supported in part by the National Aeronautics and 
Space Agency, Johnson Space Center under Contract NAS 9-14689-95. 
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m 


P(x) = I a.p.(x) 
i=l ^ ^ 


( 1 . 2 ) 


m 


such that a. 0 i = and I ot. = 1 denote the distribu- 

^ i=l ’ 

tion of the multivariate observations given that the data is unlabeleld , 
that is modeled by p(x) in (1.2). 

Definition 1 . A random sample is said to be unlabeled if the random 
vectors are selected from a population defined by (1.2). 

Definition 2 . A random sample of unlabeled data is said to be classi- 
fied data if, according to some classification rule R = (R-| .R 2 *...*R^). 
each vector in the sample is assigned to one of the (crop) classes 

^1 * ^2 * * * * * 

Definition 3. A random sample of unlabeled data is said to be verified 
data if each vector is classified as being from the true subclass n. 
for some i = 1,2,..., or with probability one. 

Verified data is classified data in which there is zero probability 
of misclassification. 

Definition 4 . A random sample is said to be labeled if it is selected 
from a single class n^. and the identity of i^*^ population is known. 

The difference between verified and labeled data is that the verified 
data must be labeled a posteriori while the labeled data is labeled prior 
to taking the sample. In both types of samples, one knows with certainty 
the label of the population from which the samples came. 

The purpose is to estimate the vector or proportions a = 

(a^ ,a 2 ^. . . »«p)^ which defines the function p(x) in (1.2). If 



3 


denotes the proportion of vectors in the mixture from class and N 
the total number of vectors in the region, then 

A. = (1.1) N a. (1.3) 

is an estimate of acreage of crop class , as a function of an estimate 
of the proportion a^. and a^. . Hence, our interest is to estimate 
well. 

Three different types of data are available for estimating the 
elements of a arise naturally in the application involving remote 
sensing from space. They all are maximum likelihood estimators for a 
using 

(a) unlabeled data, 

(b) classified data, or 

(c) verified data, respectively. 

The cost of acquiring unlabeled data is less than the cost of acquiring 
classified data which is in turn less than the cost of acquiring verified 
data. The computation of sample size allocations when samples from more 
than one type of data are available arises naturally. In the case of 
sample design one can control the type of data to be selected and the 
optimal mix of sampling can be accomplished. It is important to note 
that one always has available a random sample of unlabeled data; hence 
if Cy denotes the cost per unit of taking unlabeled data then 


c 

= C 

+ 

c. = K C 

V 

u 


V V u 


= C 

+ 

C. = K C. 

c 

u 


c c u 
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are the per unit cost where and are the costs of classifying 
and verifying in unlabeled data point respectively. The values Ky and 
Kq are multiplicative constants that give in addition to an additive 
model a second multiplicative representation of the costs. 

One would expect in most space science applications. 

It is important to note that in the space application unlabeled data is 
available as basic for two of the three methodologies for estimating a , 
and except for missing data that the totality of unlabeled data is 
also available. The cost of machine processing every vector is a 
realistic limiting factor for unlabeled and classified data while the 
cost of resources to visit each location for verification is the major 
limiting factor for obtaining verified data. 

However, it is not intuitively clear which type of data contains 
greatest amount of information for estimating a for a fixed sample 
size. The purpose of this paper is to compute and order with respect 
to magnitude the information content of the three types of data, and 
discuss the implications of that ordering for the space application. 

The term information content of the data is defined as the inverse 
of the Cramer-Rao matrix lower bound for unbiased estimators for a . 

This is the matrix form of Fisher's Information. 
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II. INFORMATION CONTENT OF VARIOUS TYPES OF DATA 

2.1 Fisher's Information: Let X denote a random observation from a 

multivariate (p-variate) population whose p.d.f. is defined by (1.2). 

If we denote the parameter vector by a = (a-j then by the 

usual theory (Cramer [2], Rao [3]) the (m-1 x l) random vector 

S = (2.1.1) 

is such that 
E[S] = 
and 

E[S sT] . - E . . E j A(a) (2.1.3) 

where A(a) denotes Fisher's information for a contained in the 
sample X . 

If X^,...,X^ denote a random sample from a multivariate population 
whose p.d.f. is defined by (1.2), then the Fisher's information for a 
contained in this sample can be shown to be 

E[S S^] = n A(a) . * (2.1.3) 

Furthermore, A”^(a) is the Cramer-Rao lower covariance matrix bound 
for unbiased estimators of the vector a . That is, 
if a is any unbiased estimator for a , then the covariance matrix A(a) 
will never be less than A"^(a) . Note that if A and B are two positive 
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definite matrices of the same size and A - B is positive semi-definite 
then we say B is less or equal to (when A - B = $) than A . 

From (1.2) it follows that 


m-1 / m-1 \ 

P(x) = I a. p.(x) +(l - I a. I p (x) (2.1.4a) 

j=l ^ ^ \ j=l y 

m-1 

= I aj[Pj(x)-p^(x)] + p^{x) . (2.1.4b) 


It follows from (2.1.1) that 


^3 m 


I a • 
j=l J 


Pj(x) 


pT^j 


and 


(2.1.5) 


9Sj [Pj(x)-p^(x)][p|^(x)-pjx)3 
’ ' [p(x)]2 

Therefore, the information for a is given by 



(m-1 )x(m-l) 


( 2 . 1 . 6 ) 


(2.1.7) 


Fisher's information can be seen as the information contained in a 
random variable X about the parameter a . This should be interpreted 
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as the extent to which, on the average, the accuracy of estimating the un- 
known parameter a can be increased as a result of the observed value x 
of the random variable X . 

In the ensuing sections of this paper, information for a con- 
tained in unlabeled, classified and verified data, defined earlier will 
be ordered. 

Above, information is defined in terms of unbiased estimators. 

2.2 Likelihood Function . If X^,X 2 ,...,X^ denotes a simple random sample 
from p(x) defined by (1.2) then the likelihood function is 

Lu(Xi,...,Xn) = p(X.) (2.2.1a) 


n 

= n 
£=1 



“o 


Pj(X 




(2.2.1b) 


the likelihood function for unlabeled data. 

Let X^,X 2 *...,X^ denote a simple random sample from p(x) which 
has been classified according to a rule R = (R^ ,R 2 ». • • »R^) » then each 
data vector Xj^ , k = l,2,...,n generates through classification new 
data defined by the random variable X.(X|^), i = l,2,...,m , where 

Y.(X^) =1 if X^ e R. (2.2.2) 

= 0 if X^ ^ R,^ 

whose joint p.d.f. is for each X|^ a multinomial 



8 


m yi(X|^) 

^Y....Y “ .!!, 9,- (2.2.3) 


where 


g . = Pr[X|^ E Rj ] 

= / p(x)dx 
"1 
m 

= I “i / Pi(x)dx 
j=l ^ ^ 

m 

= Z a,p(i ,i) , 
j=l 


the probability of classifying I(X,^) in n. , 

The likelihood function for classified data follows from (2.2.3), 
and is 


= L(Y^(X^),...,Y^(X^);...;Y^(X^),...,Y^(X^)) 


n m Y^-(X,^) 


= n n g. 
k=l i=l ^ 


n m f m I Y- 

= n n i a.P(i|j) 

k=l i = l U=^ ^ J 

m r m 1 ^i 


(X,) 


where 


^ k=l ^ 


(2.2.4) 


(2.2.5) 
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the number of sample vectors in R. . 

1 

Let Ii(X-| ) ^n^^n^ denote a random sample whose labels 

are known with probability one, that is, the data has been verified, then 



if 


= 0 

if 



then the p.d.f. of T = (T^»...,T^)^ for each I|^ is 


m t.(I.) 

'l**“’m i=l ^ 


The likelihood function of a verified sample is 


(2.2.6a) 


(2.2.6b) 


Lv = L,(Tidi),...,TjI^);...;T^(g Tjg) 


n m T.(I.) 
= n n [a.] ^ 
k=l i=l ^ 


m n. 

= n [a.] ^ (2.2.7) 

i=l ^ 

where 

n. = I T.(I.) . (2.2.8) 

^ k=l ^ 

the number of individuals in the sample from n^. . 

2.3 Information for a Contained in Unlabeled Data . 

Let the following denote the information for a contained in 


unlabeled data: X, ,...,X : 

I n 
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(m- 1 )x(m-]) 

Using ( 2 . 1 , 2 ), (z p ih) 

and synthetic division, i 


for 


^ = j 


It can be shown that 


iv. 


^ ~ )B. - 

1 fn' i/n 


oi_ 


m 


m-] 


la.+cTJ I 
1 m' k =1 

Mi 


a. 


m=] 

TaTf^rr ^ «,-B. 

1 j=] J jm 

J7i 


and for i ^ j 


a" 

IJ 


_ 1 


ct. 


m 


1 - 


(a.+a )B. - 

^ fn' im 


tn-i 


where 


k=l ni m ij 


< B. . f Pf(=<)P,{x) 

■ 'J / ~PRT 
•'«P 


dx < 1 


^jk " ®kj ’ j ?« k . 


When B . . = B 
ij ’ 


(2.3.1a) 


(2.3.1b) 


(2.3.1c) 


Ay(a) = n(l-B){A“.} 
^ J 


where 


(2.3.2a) 
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10 

1 tn 


for i = j 


(2.3.2b) 


= J- for i j 
m 


(2.3.2c) 


When m = 3 , the p.d.f. of a random variable X from a mixture 
population (unlabeled data) is 


p(x) = a^p^(x) + tt 2 P 2 (x) + « 3 P 3 (x) 


(2.3.3a) 


where 


^1 0^2 ^ *^3 ” ^ (2.3.3b) 

and 

> 0 , tt 2 > 0 , 03 > 0 . (2.3.3c) 

It follows from (2.3.1a) - (2.3.1c) that the information contained in 
unlabeled data is given by 
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A (a) = 
u 


A, 


21 


.u T 

^2 


A, 


22 


11 


72 


(l-a2) 
«2“3 
(l-a-j ) 
“ 2‘^3 


«2“3 


“ l '^2 


1 - - (l-ao)B,, - u.o 

1-tto 12 ' 2' 13 1-ao 23 


1 - 


“l“3 


“ 1“2 


T*^ B,, - B.„ - (l-ajB 

1 -a-| 12 1 -a-] 13 1 


23 


(2.3.4a) 


(2.3.4b) 


where 



12 


= — 

a^2 /' 2 I 


“3 




(2.3.4c) 


Note that one minus (2.3.1c) can be regarded as a distance measure. 

That is, when the i^*^ and populations are "close together" or "far 

apart" then (1-B..) will be small or large, respectively. In fact, 

several investigators [3], [5], [6] , have employed a form of (2.3.1c) 

as a probabilistic distance measure for feature selection. While Cover 

and Hart [8] have shown that 2a. a. B.. corresponds to the asymptotic 

1 J ^ J 

nearest neighbor probability of error, this motivates a possible 
estimating procedure (see section 4. ) using a nearest neighbor procedure. 

It is of interest to consider the behavior of B.. in terms of a 

* sJ 

popular distance measure as the distance between the i^*^ and popula- 
tions diverges. This behavior is described in Leimia 2.3.1. 

t h h 

Lemma 2.3.1 ; Let the distance measure between the i and populations 
be given by 


f 

P^-(x) 

1 [p.j(x) - Pj(x)] log 



dx . 


(2.3.5) 


If A . . for all i ^ j , then B . . ->-0 . 
1 j 1 j 


Proof: Toussant [4] has shown that 


0 < 


B. . 
U 




1 


Note that as 




<» then 


0 • 
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Note that (2.3.5) is known as the divergence between two distributions. 
For normal distributions with equal covariances, (2.3.5) reduces to 
the, well known Malhanabis distance. 

The following example can clarify some of the concepts introduced 
above : 


Example 2.3.1 ; 



X , 0<x<l j 

x-1, l<x<2 1 

f 

x-2, 2<x<3 

P^(x) = . 

2-x, l<x<2 , P 2 (x) = 1 

3-x, 2<x<3 , P 3 (x) = 1 

4-x, 3<x<4 


0 , o.w. 1 

0 , o.w. 1 

0 , o.w. 


p(x) = oL|P^(x) + a2P2^^) * * 


Let 


“l ° * “3 ' I 


then 


B 


12 


4 



P^(x)p2(x) 

■p(x) 

2 

3(2-x)(x-l)dx 

(3x-2-x^)dx 


fixKxiIL dx 
^2-x+x-l ) 



= 3 



3 

(3-x)(x-2)dx 


1 

7 




, ,i i i, 
^^ 3 * 3 ’ 3 ^ 


5 

7 


5 

7 

7 

2 



To conclude this section, a result that follows from Letrnia 2.3,1 


is given. 

Theorem 2.3.1 : Let A., be a distance measure defined 

* J 


If A. . -► «> for all 

i 3 ^ j then 

A^(a) ^ Ay(a) = 

n{Ay.} 

ij 

where 


1 1 m 

\ “-{“m 

V ) 1 n» 

Nr L 

» “m 

for i = j 

for i ^ 3 


Proof; Using equations (2.3.1a) - (2.3.1c) and letting 
Theorem follows from Lemma 2.3.1. 

Note that (2.3.2a) can be written as 

A^^(a) = n(l-B)Ay(a) . 

The information matrix Ay(a) is the information for a 
verified data. This is a topic of the next section. 

2.4 Information for a Contained in Verified Data 

Let be defined as in (2.2.6a). It follows 


by (2.3.5). 


^ij ** * 


(2.3.6) 
contained in 


from (2.2.7) 
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since I a. = 1 . 
j=l J 


From (2.1.1) then S. = 

j da. 

3 £n Ly V 

Sv = -1^ = 


it follows that 


where 




(2.4.2) 


n . n 

= — — j j “ l,...,m-l . 

“j “n 


In matrix notation 


Sv = V 


(2.4.3) 


where the (m-l)xm matrix is given by 


A = 
a 


0 0 ., 


a 


1 “0 ., 


a 


0 0 
0 0 


a_ 


m 


a 


m 


0 0 0 ... 0 


0 

1 


a. 


m-1 


ot_ 


m 


(2.4.4) 


and 


n = (n^,...,n^) 



16 


Note that by the Cramer-Rao theory the expected value of S is the 
zero vector which we will verify directly. 


E[Sy] = E[A^?n 
= A E[rT] 

= n A a since n. ~ multinomial (n,a-) for j = l,...,m . 

'*• w J 

Now, 


A a = 
a 


— 0 0 ... 0 

“1 

0 - 1-0 ... 0 

oil 




• • • 




(2.4.5) 


Thus, 


E[Sy] = $ . ' (2.4.6) 

The information matrix for a when sampling from verified data 
can now be computed by finding the covariance matrix V(Sy) of Sy 
using (2.4.3) and (2.4.6), that is, 

Av(o) = V(S) 

= m 


(2.4.7) 



a 
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““ T 

where V(rT) is the covariance matrix of the n = (n^,...,n|^) , 

multinomial vector variate; that is, 

V(n) = n[Diag(a-| . ,a^) - ota^] . 

From (2.4.7), (2.4.8) and (2.4.5), 

Ay(a) = A^[Diag(a^ ,. . . ,a^) - aa^] 

- A^[Di ag (a-j , . . . ,01^) ] A^ . 

For exemplary purposes consider the case when m = 3 , then since 


(2.4.8) 


(2.4.9) 




/V!3 j_\ 

“l“3 *^3 

I ^ 

\ “3 °‘ 2°‘3 / 


9 


Suppose we are given an unlabeled sample 


(2.4.10) 


X 


1 


9 • 


,X 


n * 


Then we verify this sample generating the sample 
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T 


1 


,T^ , where T^. = (T^.^ 



For estimating should we disregard the unlabeled sample or consider 
the joint sample (X^. ,T^. ) , i = l,...,n? The joint p.d.f. of 

(XjjT^) j i — l)»>»jn IS 

p(x^.t^) = p(x^. |t^.)p(t^.) . t^. = (t^p...,t^jj^) 


m 


n 

j=i 


[Pj(Xi>] 



m 

n [a,] 

j=1 ^ 



m t. 

= n [a. p.(x.)] 

j=l J ^ ’ 


(2.4.11) 


To answer the above question consider the following theorem. 

Theorem 2.4.1 : The amount of information for a contained in the obser- 
vation (x^. ,t^. ) is equal to the information for a contained in the 
observation t^. alone. 

Proof: Taking the logs of both sides of the equality in (2.4.11), we see that 

m m 


In p (x. ,t.) = I t.. £n p.(x. ) + I t.. In a. 

j = l J * j — I * V J 


Now taking derivative with respect to aj we have 

m 

3 In p(x.,t.) ^ -1. ^ij ®j 9 In p(t.) 

! ! — = 0 + — - — = — 

9 a. 9 a. 9 a. 

«J \J J 

Therefore, 


9^£np(x^. ,t^.) 

= - E 

9^£np(t.) 

^ 2 

9 a. 

L ' “j J 


J 
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Thus, it follows from Theorem 2.4.1 that for estimating a the 
joint sample (X.,T^.)» i = l,...»n contains no more information than 
the sample T^...,T^ alone. 

2.5 Information for a Contained In Classified Data . 

Using the likelihood function given in (2.2.4) for a random sample 
defined in (2.2.2), it follows that 
m 

£n L = I N. £n g. 
c i=l 1 1 

m-1 / m-1 \ 

= 3, N,j£n 


m-1 

1 - I 

i=l ^ 


since 

m 

i 9i = 1 • 
i=l ^ 

m 

Also, from (1.3.6) and I 1 that 

i=l ^ 


^i ^ oij.[p(i|j) - P(ilm)] + P(ilm) 


(2.5.1) 


and 


= P(i|j) - P(i|m) . 

3 

c 9 ^ 

From (2.1.1) and ^ — it follows that 


(2.5.2) 


Sj = j N, 


(2.5.3) 


or in matrix notation 
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Sj, = [A.jf G“^ N (2.5.4) 

where the (i;i-l)xni matrix [A..]^ is defined by its elements 

A*.j = P(i|j) - P(i|m) . (2.5.5) 

(2.5.6) 
and 

N = (N^.N2....,N^)^ . (2.5.7) 

Note that by the Cramer-Rao theory the expected value of S is 
the zero vector which we will verify directly. 

E[S^] = E[A*.j.f N 

= [A*.jf G-^ E[n1 

= G“^Ng) , (2.5.8) 

where 

9 “ (o^j i92»* • ' *9^) 
or 

g = GJ (2.5.9) 


0 0 
0 92 0 


. . 0 

. . 0 


G = 


’m 


where 



21 


J = 0.1.. ...1)^ . 

I 

It follows from 
m 

I P(i|j) = 1 
i=l 

! I 

for j = l,2,,..,m that 

= $ (2.5.10) 

* \J 

and in turn from (2.5.8) and (2.5.9) that 

E[S_] = N[A*. J G"^ GJ = $ . (2.5.11) 

C Ij 

The covariance matrix V(S ) of S can now be computed using 
(2.5.4) and (2.5.11), that is 

V(S^) = 6"^ V(N) 6"‘‘[A*.j] (2.5.12) 

where V(n) is the covariance matrix of the FT = (N^ .N^.. • • .N^^) . a 
multinomial vector variate, that is 

V(N) = N[G-GJJ^ G] (2.5.13) 

= NG(I-JJ^G) 

= N[G-Pcxa''^P] 

where 
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From (2.5.10), (2.5.12), and (2.5.13) 

A^.(a) = V(S^,) = N[A*..f , (2.5.14) 

the information for a contained in classified data. 

For completeness we state the 'following theorem. 

Theorem 2.5.1 : 

Aj.(a) -*■ A^(a) as P -*• I 

where 

P = (P(i|j)} . 

Proof; In matrix notation, 
g = Pa . 

Let P -> I , then g a and 

1 for i = j ^ m 
-1 for i = m 
0 o.w. 

that is. 
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Note that (2.4.9) can be written as 

\<“) = 

VVi / 

where I , is a (m-l)x(m-l) identity matrix and 
m- 1 

"Vi " • 

m - 1 

Thus, 

A,(a) G-’[A*,.] [A'jf[diag( i ... i )]Kj] 

as P I . 

For exemplory purposes consider the case when m = 2 , then 
[A.jf = [P(1|D - P(l|2) , P(2|l) - P(2|2)] . 

9l 0 
0 92 

9-j “ 1 " 92 * 

P(l|l) = 1 - P(2[l) and 



(2.5.15) 


Ay(a) 


since 


P(2j2) = 1 - P(l|2) , 
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then 


A^1 (a) 


= Nrp(in)-p(ii2)r 


9192 


(2.5.16) 


Suppose further, that if there are no errors in classification, that is. 


P(l|l) = P(2|2) = 1 


then 


9 l = tti and 92 = cx 2 


and 




Note that for this case. A" (a) is the variance of a sufficient 
statistic ®i “ in a binomial probability density function. 
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III. THE MAIN RESULT 


3.1 The Ordering of the Information for a . 

For the two population case (m=2) , the information for 
contained in unlabeled, verified and classified data are given respectively 
by 


Au(a) = 


N(1-B) 

a-|a2 




N 

“1^ * 


where 


■I- 

rP 


P^(x)P2(x) 
P(x) 


dx 


(3.1.1a) 


(3.1.1b) 


and 


^ y-j ^2 


(3.1.1c) 


The similarity of A , A and A is striking and one notes in 

V ^ U 

this case an obvious ordering exists, that is 


Ay(a) > A^(a) (3.1.2a) 

and 

Ay > A|j(a) . (3.1.2b) 

The inequality (3.1.2a) holds since 
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N[P(in) - P(1|2)] 


[a^P(l|i) + a2P(l|2)JLl-giJ 


However, 


91 = a^P(l 1) + (l-a^)P(ll2) 

92 = 1 - 9 ] 
implies 

9,92 = a,0-a2)EP(1|l)-P(l|2))^ + ^ POIDCl-POIl)] 


+ ^PO|2)[t-PO|2)]] • 
1 

Let 

» = [pnii)-p(ii2)]^ 
c 9^92 

( 1 -a^ ) 


Since 0 < < 1 , one can conclude for m = 2 , that 


A^(a) 


(1-a^ ) 


or 


From (2.6.1a) and the fact that 
def 

0 < = 1 - B < 1 


(3.1.3) 


(3.1.4) 
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implies that (3.1.2b) holds, that is, for m=2 
Au(a) < A^(a) . 

In this section, we will establish the following ordering of the 
information for a : 

A^(a) < A^^(a) < A^{a) . (3.1.5) 

(Note that if A and B are two positive definite matrices of the 
same size and A - B is positive semi-definite then we say "B is less 
than A".) This result will be given in a corollary to a Theorem proved 
by Rao [ 3i]. 

Note that classified data defined in (2.2.2) is a explicit trans- 
formation of the unlabeled data. Knowing this, it follows directly from 
the following Theorem due to Rao [3] that 

Aj.(a) < A^^(a) . 

Theorem 3.1.1 (Rao); The matrix A^ - Ay is semi -positive definite, 

I 

where Aj is the information matrix in a measureable function T 
of X . 

The ordering betv/een Ay with A^ and A^. is not as straight- 
forward. The ordering (3.1.5) is proved in corollary 3.1.1 which will 
be proved very sim'ilarily to the proof of Theorem 3.1.1 once the 
following three lemmas are proved. 

Suppose one takes an unlabeled sample and then classifies it, then 


let 
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Z = (X^,Y(X)) , Y(X) = (Y,(X)....,YjX)) 

when Y-(X) = 1,0 if x e R. , x ^ R. respectively. 

J J J 

Lemma 3.1.1: The p.d.f. for Z is given by 

(P«(x) , if X e R. and y. = 1 for some j = l,...,m 

P,(z) = * ^ ^ 

’0 , o.w. (3.1.6) 

Proof; 

p^Cz) = p(x,y) 

= Pr(Y(x) ='y|X=x) p^{x) 

Now (3.1.6) follows from 

1 if X e Rj and = 1 for some j = 

PJY(X) = y |X=x) = 

0 o.w. 

since P^(Yj(x) = 1 and Yj^(x) = 1) = 0 for j k . 

Recall from Sections 2.3 - 2.5 that 


s„ = (S“) . 

(3.1.7a) 

s, = {S'} , 

(3.1.7b) 

Sc= (Sj) . 

(3.1.7c) 

J ~ 1 J • • • 1 



for j = 1 , . . . , 
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where 


J 


P^(x) - pjx) 
■p7x) ■ 


s '' 


T, 


“J 


m 


a_ 


m 




m Y. 

I — A.. 
• 1 9- 
1=1 ^1 


1J 


for j = . 

Furthermore, we know that 


E = E = E Sc = $ 

Lemma 3.1.2: 

(i) E[SjY=y] = S^ 


(3.1.8a) 

(3.1.8b) 

(3.1.8c) 


(3.1.9) 


(3.1.10a) 


(ii) E[SjX=x] = Sy 


(3.1.10b) 


(iii) E[SjY=y] = S^. . 


Proof: 


(3.1.10c) 


(i) For each j = l,...,m-l, it follows from (3.1.8a) that 

-f ^ ^ . 


Let 


Y - ^(k) ~ (0* • • • >1|^ »0» • • • *0) 


30 


where Ij^ indicates that = 1 • Then it follows from Lemma 3.1.1 
that 




Pj(x)-P„(x) 


p(x) 


2M 

9k 


dx = 


= ^CP(k|J)-P(k|m)] 
9k 

9k * 


(Note that 

Thus, in general we have 

E[sV|Y=y] - ^ j • 


(ii) For each j = l,...,m-l , it follows from (3.1.8b) that 


ECS*|X=x]= I f(t|x) 

{t|p(t|x)>0) 

_ E(t(i)|x) f(t(„)|x) 

J m 

where 
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Note that f(t|x) = _ 


Hence, it follows that 

V a.p.(x) 

Pi(x)-P„(x) 

' p(x) 


an,Pm(^) 

mm 

VTO~ 


Sj , for j 


= 1 , . . . ,m-2 


(iii) Suppose y = for k = l,...,m , then for j = l,...,m-l 
it follows from (3.1.8b) that 




“j 


m 


It can be easily shown as follows: 


1 <<i 


' h(y(„)|t(j))m(j)') 

'■'•(''(kr’lt(.ir') i 


P(k|j)gj 


= Q(j|k) 



E£sjiv=,,^,] = ai^-ai^ 




“j9k 

9k 

^ • 

9k 


“m 9 


m^k 


- P(klm)] 


In general, we have 

„ m y .A . j 

E(s"|Y-y) • I 

J i=1 9^ 




Sj , for j 


Lenina 3.1.3: (i) E(S S ) = A 

c u c 


<'*) 


(ill) E(S^S^) = . 


Proof: 


(i) E(S^S^^T) = E{E(S^sJ|Y=y)} 

= E{S^E(s/|Y=y)} 


It follows from Lenina (3.1.2) that 



(ii) and (iii) are similarly proved. 


Corollary 3,1,1; 

(1) a„-a^=d, 

(1i) = Dj 

(Hi) V\ ' “3 

where D^, D 2 and Dj are positive semi-definite matrices. 

Proof: 

(i) Since ES = E$ = $,, then by definition, the covapance 

^ ' U ‘ * t I ‘ 

matrix, of S^-S^ is gi.yen by, , . 


E(V5c»V\> • 

, I r ’ . r' , . 

Now, (3.1.11) can be written as 

^ c c c Tj_c e f V L'c-p'c f 


( 3.1 


' '■> * '.M ! , 


E(S s ^-s s ^+s s ^+s s ='es's ^-es s ^-es s ^+es's ^ 

‘■'''u u u c c u c c ' u u u c c c c c 


It follows from Lemma (3.1.3) that 


= A -A 
u c 


T ^ 
T 


E(Su-Sc)(Su-Sc)' = Au-"^V^c 


= A -A , since A' is symmetric, 
u c c 


Since by definition, (3.1.11) is positive semi -definite, then A -A 

U V# 

is, positive semi -definite. 

(ii) and (iii) are similarly proved. 
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4. APPLICATION AND CONCLUSIONS 


The central questions now include the following: Should one spend 

resources to verify data to gain information? Should one spend the 
allocated amount on verifying a small amount of data or process a large 
amount of unlabeled data? Is there any advantage at all to processing 
classified data. 

4.1 Concerning Classified Data 

In the space application the total data set is made up of unlabeled 
data which can be processed directly to obtain the true value of a 
or more realistically due to the magnitude of the set he sampled to 

A 

estimate a . Let g^ = 

^ y=l J 

ct 4 j = l,2,...,m, then since In general 
. m 

E[9,] • I PdldKi" a, (4.1.1) 

’ J'l J ’ 

^ ^ A A /s, 

it follows that if g = (g^ .g 2 »- •• »9^) » then g is a biased estimator 
for a . In matrix notation 


I Y../N = N^/N be an estimator 


E[g] = Pa = g 


where g = [Pr(x e R.j)] , which implies 
a = P'^g . 

Note that if one defines 

» 1 A 

= p 'g 


i 
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that E[6i] = P‘’^E(g) = P’^Pa = a and a is an unbiased estimator for 

a , when P is known . Unfortunately, the matrix P”^ is unknown; hence 

must be estimated. The sample used to estimate P ^ is called test 

data. There is bias in. the estimator a.- .P g when P is 
-1 ^ -1 

replaced by (P . ),= (P) hence a will be biased. , , 

Note also that in (4.1.1) it has been assumed that aind E. arc 

t . . t , ' ■ • • . ‘ • ■ . • ■ I. 

known when in fact they are not known but must be estimated. The 
sample for, estimating these parameters are called the training data 

I I * 

(the data to "train" a classifier). 

One must also select a classification rule. Two candidates naturally 
are candidate^. The Bayes classification procedure and the maximum 
likelihood procedure. The Bayes classifier is optimal with respect to 
minimizes the expected costs of misclassification but unfortunately is 
a function of the elements of o hence in practice cannot be used. The 
analysis and results in this paper are not dependent on the type of 
classifier used. 

In Table 4.1 the values of information for various values of a.j 
when m = 2 and n = 1 as function of type of classifers and for various 
distance between the subpopulation p-j(x) and p£(x) each assumed to 
be normal, hence p(x) = p^(x) + (l-a)p 2 (x) is a mixture of tv/o normals 

(A = P]"Vi 2 ^1 ” ^2 ” identity matrix). The symbols Ag and 

A|^Le denote the information using a Bayes classifier and the maximum 
likelihood classifier, respectively; A^ is information using verified 


data. 



Table 4.1. Approximate Values of A^, Ag, 4 Mixture of Normals 
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In Table 4.2 values of information are given for various values 

2 2 

of 4 , k and a-| when 02 = ka-j and p(x) is a mixture of two 

2 

univariate normal p.d.f. The value selected for a-j = 1 and n = 1 . 

4.2 Conclusions 

The surprising result that classified data has the least informa- 
tion is especially important since current practice in processing remote 
sensed data is to classify the unlafaeled data. It is true that it may 
be easier to classify than compute the maximum likelihood estimates for 
a using unlabeled data. Hence classifying the data would be a necessary 
task. The information in classfied data is nearly equal to but always 
less than the information in unlabeled data. 

Note also, if the expense to verify data is sufficiently small 
then the unlabeled data taken remotely from sapce is not needed. A random 
sample of locations on the earth's surface is sufficient to estimate 
a . The satellite data is of no value except to randomly select. sites 
for verification. 

If training data and test data are in reality classified data 
(that is, unlabeled data classified by photo interpreters) one can and 
should expect loss of information. However, training data and test 
data are in fact verified or labeled (ground truth with no classifica- 
tion error) one should expect better results in estimating a . 
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Table 4.2. Information ^ for Various Types of Data (v,u,c) Versus 
Values of the Parameters (k,A,a-|). 



















